L 

Number 


Hits 


Search Text 


DB 


Time stamp 




555 


raid and ((receiv$4 and transfer$4 and 


US PAT; 


2004/04/15 






execut$4) with (request or command)) 


US-PGPUB; 


15:41 








EPO; JPO; 










DERWENT; 










IBM_TDB 




_ 


462 


(raid and ((receiv$4 and transfer$4 and 


USPAT; 


2003/10/27 






execut$4) with (request or command))) and 


US-PGPUB; 


10:45 






host 


EPO; JPO; 










DERWENT; 










IBM_TDB 




_ 


456 


((raid and ((receiv$4 and transfer$4 and 


USPAT; 


2003/10/29 






execut$4) with (request or command))) and 


US-PGPUB; 


17:09 






host) and storage 


EPO; JPO; 










DERWENT; 










IBM_TDB 




_ 


458 


((raid and ((receiv$4 and transfer$4 and 


USPAT; 


2003/10/29 






execut$4) with (request or command))) and 


US-PGPUB; 


17:09 






host) and storage 


EPO; JPO; 










DERWENT; 










IBM_TDB 




_ 


871 


((((receiv$4 same transfer$4 same execut$4) 


USPAT; 


2003/10/30 






same (request or command))) same host) and 


US-PGPUB; 


08:58 






storage 


EPO; JPO; 










DERWENT; 










IBM TDB 




_ 


8 


5548788. URPN. 


USPAT 


2003/10/30 










08:59 




27 


("4371929" | "4394732" | "4755928" | 


USPAT 


2003/10/30 






"4843544" | "4870643" | "4922418" | 




09:01 






"4965801" | "4975829" | "5101490" | 










"5133060" | "5148432" | "5163132" | 










"5175822" | "5175825" | "5179704" | 










"5191581" | "5206943" | "5233692" j 










"5237660" | "5241630" | "5247622" | 










"5276806" | "5371855" | "5375227" | 










"5390186" | "5418925" | "5459856"). PN. 






_ 


6975 


raid 


USPAT; 


2003/10/30 








US-PGPUB; 


09:09 








EPO; JPO; 










DERWENT; 










IBM_TDB 






94 


(((((receiv$4 same transfer$4 same execut$4) 


USPAT; 


2004/03/24 






same (request or command))) same host) and 


US-PGPUB; 


17:20 






storage) and raid 


EPO; JPO; 










DERWENT; 










IBM_TDB 




_ 


20 


((raid and ((receiv$4 same transfer$4 same 


USPAT; 


2003/10/30 






execut$4) with (request or command))) same 


US-PGPUB; 


09:10 1 






host) and storage 


EPO; JPO; 










DERWENT; 










IBM_TDB 




_ 


74 


((((((receiv$4 same transfer$4 same execut$4) 


USPAT; 


2003/11/03 






same (request or command))) same host) and 


US-PGPUB; 


13:06 






storage) and raid) not (((raid and ((receiv$4 


EPO; JPO; 








same transfer$4 same execut$4) with (request 


DERWENT; 








or command))) same host) and storage) 


IBM TDB 




- 


422 


honda-k.in. 


USPAT; 


2003/10/30 








US-PGPUB; 


09:26 








EPO; JPO; 










DERWENT; 










IBM TDB 
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155 


honda-kivoshi in 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/10/30 
09:26 




2 


ff ( ( ( receiv$4 same transfer$4 same execut$4^ 


USPAT; 


2003/10/30 






same (request or command))) same host) and 


US-PGPUB; 


09:31 






storage) and honda-kiyoshi.in. 


EPO; JPO; 

DERWENT; 

IBMJTDB 






56 


f f (Y(receiv$4 same transfer44 same execut$4^ 


USPAT; 


2003/10/30 






same (request or command))) same host) and 


US-PGPUB; 


10:28 






storage) and hitachi.as. 


EPO; JPO; 

DERWENT; 

IBM_TDB 






1514 


redundant adi arrav adi inexpensive 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/10/30 
10:00 




1 


(redundant adj array adj inexpensive) WITH 


USPAT; 


2003/10/30 






host with (receiv$4 near3 request) 


US-PGPUB; 
EPO' JPO* 
DERWENT; 
IBM_TDB 


10:08 




36 


(( ( (( receive 4 same transfer$4 same execut$4^ 


USPAT; 


2003/1 1/03 






same (request or command))) same host same 


US-PGPUB; 


15:00 






first) and storage) and raid 


EPO; JPO; 

DERWENT; 

IBM_TDB 






6 


ffDluralitv or multiole^ adi3 fstoraae^ with host 


USPAT; 


2003/11/03 

£- \J \J *J / 11/UmJ 






with receiv$4 with trans$4 with execut$4 


US-PGPUB* 
EPO* JPO' 
DERWENT; 
IBM_TDB 


15:22 




30 


( f oluralitv or multinle^ adi3 (storane^ with host 

l l UIUI Ullljf \Ji 1 1 IUKIUIv 1 V4 VJ J *J \, sj 1 UU^ 1 / Willi ] IU JL 


USPAT; 


2003/11/03 






with ((receiv$4 or trans$4 or execut$4) near5 


US-PGPUB- 


15:33 






request) 


EPO; JPO; 

DERWENT; 

IBM_TDB 






10 


(oluralitv or multinle^ with host with receiv$4 

l L/IUI UIIVY VI 1 1 lUILIUlU 1 HIV.II 1 1 VVV Willi 1 wl»wl V T 


USPAT' 


2003/11/03 






with trans$4 with execut$4 with request 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


16:46 




2011 


f disk near arrav^ ab 


USPAT' 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


2003/1 1/04 

CUUJ/ 11/ vT^ 

14:38 




356 


r r disk near arrav} ab 1 and host and redundant 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/1 1/04 
14:38 




201 


(disk near array same redundant). ab. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2003/11/04 
14:42 
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40 


(disk near arrav same redundant same 


USPAT; 


2003/11/04 






transfer$4).ab. 


US-PGPUB; 
EPO; JPO; 
DERWENT* 
IBM TDB 


15:32 


- 


18 


(disk near array same redundant same 


USPAT; 


2003/11/05 






transfer$4).ab. 


US-PGPUB* 
IBM TDB 


09:16 


- 


98 


RAID1 


USPAT; 

US-PGPUB; 

IBM_TDB 


2003/11/05 
11:27 




2 


6282610. pn. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/05 
12:20 




15 


fstoraae adi device^ near5 network near5 serial 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT- 
IBM TDB 


2003/11/05 
12:22 


- 


350 


(storage adj device) adj network 


USPAT; 
US-PGPUB* 
IBM TDB 


2003/11/05 

1223 


- 


40 


((storage adj device) adj network ) and raid 


USPAT; 
US-PGPUB- 
IBM TDB 


2003/11/05 
12:49 


- 


2705 


storage adj subsystem 


USPAT; 
US-PGPUB" 
IBM TDB 


2003/11/05 
12:53 


- 


1 


6484229. pn. 


USPAT; 

US-PGPUB- 

IBM_TDB 


2003/11/05 
12:54 




1 


5651132 on 


USPAT; 

US-PGPUB; 

IBM_TDB 


2003/11/05 
13:05 




6 


f"4761785" 1 "5353424" 1 "5373128" 1 
"5410667" | "5418921" | "5469548"). PN. 


USPAT 


2003/11/05 
12:57 


- 


4348 


(ring or loop or toreus) near network 


USPAT; 
US-PGPUB' 
IBM TDB 


2003/11/05 
13:05 


- 


3688 


(ring or loop or toreus) adj network 


USPAT; 
US-PGPUB- 
IBM TDB 


2003/11/05 
13:05 


- 


2945 


ring adj network 


USPAT; 
US-PGPUB- 
IBM TDB 


2003/11/05 
1 306 


- 


23 


(storage adj subsystem) and (ring adj network) 


USPAT; 
US-PGPUB- 

w«j ruruu^ 

IBM TDB 


2003/11/05 
13-20 


- 


21835 


raid or ((plurality or many or multiple) adj 


USPAT; 


2003/11/05 






fstoraae or disk^ 


US-PGPUR- 
IBM TDB 

A W 1 1 V^ V/ 






78 


(ring adj network) and (raid or ((plurality or 


USPAT; 


2003/11/05 






manv or multiDle^ adi fstoraae or disk^l 

III V* 1 1 I \S 1 III VI 1 VI U/ 1 \^ J Vi Wj I \* w I VI VJ Vj* V 1 VI J J J 


US-PGPUB' 
IBM TDB 


13:21 




66 


((ring adj network) and (raid or ((plurality or 


USPAT; 


2003/11/05 






many or multiple) adj (storage or disk)))) NOT 


US-PGPUB; 


13:56 






((storage adj subsystem) and (ring adj 


IBM_TDB 








network)) 








30 


"store and forward" 


USPAT; 

US-PGPUB; 

IBM_TDB 


2003/11/05 
13:58 
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4 


f serial adi raicH ab 

l V*^ J 1 VI 1 VI VI I I VI 1 VI J i vi Ly • 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/05 
16:19 




2 


"20020178328" 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/06 
09:39 




5 


(serial near5 raid).ab. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/05 
16:35 




1 


(( serial near5 raid} ab ) not (( serial adi 


USPAT; 


2003/11/05 






raid).ab.) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


16:36 




42 


serial near5 raid 

•J V* 1 1 WI 1 l| V* VI 1 1 VI 1 VI 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/05 
16:35 




2152 


ssa 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/05 
16:37 




85 


ssa same raid 

w VI w VI III w 1 VI 1 VI 


USPAT; 
US-PGPUB; 
EPO* JPO 1 
DERWENT; 
IBM TDB 


2003/11/05 
16:37 


- 


0 


1260904. URPN. 


USPAT 


2003/11/06 
09:14 




1 


"20020178328" and f (cooneration adi control 


USPAT; 


2003/11/06 






adj information) near20 (first adj identification 


US-PGPUB; 


12:40 






adj information)) 


EPO; JPO; 

DERWENT; 

IBM_TDB 






4126 


daisv near chain 

u u i j y i i v^u i i iuii i 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


2003/11/06 

4m+ Vf V* W / ^ A/ \/ V/ 

12:40 




4125 


daisv adi chain 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/06 
12:40 




19 


( daisv adi chain! near20 raid 

V VI VI 1 v J VI UJ V_ I I U II 1 J II V* VI 1 w | VI 1 VI 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2003/11/06 
12:41 




2 


5768623. pn. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2003/11/06 
17:39 
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2 


"20020178328" 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2004/03/24 
15:45 




11440 


(raid or (array near disk)) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2004/03/24 
17:21 




2831 


((raid or (array near disk))) and serial$3 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2004/03/24 
17:21 




114 


(((read or write) near3 request) same parit$3) 


USPAT; 


2004/03/24 






and (((raid or (array near disk))) and serial$3) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


17:29 




47 


((((read or write) near3 request) same parit$3) 


USPAT; 


2004/03/24 






and (((raid or (array near disk))) and serial$3)) 


US-PGPUB; 


17:23 






and 711/114. eels. 


EPO; JPO; 

DERWENT; 

IBM_TDB 






7 


(((read or write) near3 request) same parit$3 


USPAT; 


2004/03/24 






same seriai$3) and (((raid or (array near disk))) 


US-PGPUB; 


17:48 






and serial$3) 


EPO; JPO; 

DERWENT; 

IBM_TDB 






0 


(((read or write) near3 request) same parit$3 


USPAT; 


2004/03/25 






same (serial$3 near connect$3)) and (((raid or 


US-PGPUB; 


10:22 






(array near disk)))) 


EPO; JPO; 

DERWENT; 

IBM_TDB 






34 


(raid or (redundant adj array near disks)) and 


USPAT; 


2004/04/15 






(disk$2 near4 serially) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


15:43 




3 


(raid or (redundant adj array near disks)) and 


USPAT; 


2004/04/15 






(serially near connected near3 disk) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM^TDB 


15:43 




2 


5835694. pn. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


2004/04/19 
12:47 




657 


raid and f(creat$3 or mak$3 or Generates 1 ) near 


USPAT; 


2004/04/21 

^\J\J~f \J~f £— -L 






parity) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


09:48 




442 


raid and (((creat$3 or mak$3 or generat$3) 


USPAT; 


2004/04/21 






near parity) same writ$3) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


15:31 
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66 


raid and (((creat$3 or mak$3 or generat$3) 


USPAT; 


2004/04/21 






near parity) same (writ$3 near request)) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


11:34 




0 


"data to generate parity" 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


2004/04/21 
14:16 




75 


raid and (((creat$3 or mak$3 or generat$3) 


USPAT; 


2004/04/21 






near parity) same writ$3 same fail$4) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


15:31 




2 


raid same synchronously same request 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


2004/04/21 
16:04 




80 


raid and ((stor$3 or sav$3 or cop$3) near3 


USPAT; 


2004/04/21 

mm \J W I f V I f mm «!• 






synchronously) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


16:05 




14 


raid and ((stor$3 or sav$3 or cop$3) near3 


USPAT; 


2004/04/21 






synchronously same request) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


17:24 




122 


raid and ((stor$3 or sav$3 or cop$3 or writ$3) 


USPAT; 


2004/04/21 






near3 synchronously) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBMJTDB 


17:24 




33 


raid and ((stor$3 or sav$3 or cop$3 or writ$3) 


USPAT; 


2004/04/21 






near3 synchronously same request) 


US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


17:24 
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[57] ABSTRACT 

A fault-tolerant high performance mirrored disk subsys- 
tem is described which has an improved disk writing 
scheme that provides high throughput for random disk 



writes and at the same time guarantees high perfor- 
mance for disk reads. The subsystem also has an im- 
proved recovery mechanism which provides fast recov- 
ery in the event that one of the mirrored disks fails and 
during such recovery provides the same performance as 
during non-recovery periods. 

Data blocks or pages which are to be written to disk are 
temporarily accumulated and sorted (or scheduled) into 
an order (or schedule) which can be written to disk 
efficiently, which in a preferred embodiment is in accor- 
dance with the physical location on disk at which each 
such block will be written. This also generally corre- 
sponds to an order which is encountered by a write head 
during a physical scan of a disk. The disks of a mirrored 
pair are operated out of phase with each other, so that 
one will be in read mode while the other is in write 
mode. Updated blocks are written out to the disk that is 
in write mode in sorted order, while guaranteed read 
performance is provided by the other disk that is in read 
mode. When a batch of updates has been applied to one 
disk of a mirrored pair, the mirrored pair switch their 
modes, and the other disk is updated. Preferably the 
updates are kept in a non-volatile memory, which fur- 
thermore advantageously may be made fault-tolerant as 
well. 

During recovery a pair of spare alternating mirrored 
disks is introduced to which new updates are directed, 
while a background scan process copies data from the 
surviving disk to the new mirrored pair. 
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rored disks do require that each data write be written on 
DIGITAL STORAGE SYSTEM AND METHOD both disks in a mirrored pair. Thus, it is generally ac- 
HAVING ALTERNATING DEFERRED UPDATING cepted that mirrored disk storage systems impose a 
OF MIRRORED STORAGE DISKS performance penalty in order to provide the fault toler- 

^ ance. 

FIELD OF THE INVENTION In a pen ding patent application Ser. No. 8-036636 

This invention relates generally to fault tolerant digi- filed Mar. 24, 1993, assigned to the same assignee as this 
tal storage disk systems and more particularly to digital patent application, entitled "Disk Storage Method and 
storage disk systems of the mirrored disk type in which Apparatus for Converting Random Writes To Sequen- 
reliability is provided by storing digital information in 10 tial Writes While Retaining Physical Clustering on 
duplicate on two separate storage disks. Disk", some of the inventors of the present invention 

BACKGROUND OF THE INVENTION disclosed a method for improving the performance of a 

single disk or a RAID array. This is done by building 
As the requirements for OnrLine Database Transac- sorted runs of disk writes in memory, writing them to a 
tion Processing (OLTP) grow, high transaction rates on 15 ]og ^ merging the sorted rims from the log disk and 
the order of thousandsof transactions per second must applying them in one pass through the data disks with 
be supported by OLTP systems furthermore, these j batch writcs mcthod has ^ advantagc of 
applications cal for high avadabUity and fault toler- x } converting random writes i„to sequential writes. 

S^Xi^S 38 7^ T 20 One problem with this approach, howeveV, is that when 

the requests are random accesses to data. Since a large 20 „ ...... L*^u 

fraction of the data resides on disks, the disk sub-sys- * ? t t- f*JFt f- ♦ 

terns must therefore support a high rate of random ac- ^ uests ^e delayed while the batch is wnt- 

cesses, on the order of several thousands of random ten V lead «g to a penalty m disk read response time, or 
accesses per second. Furthermore, the disks need to be the batch wntes m ^errupted by the read, leadmg to 
fault tolerant to meet the availability needs of OLTP. 25 a loss in write (and therefore overall) throughput. 

Whenever a random access is made to a disk, in gen- Ethcr wav the overaJ1 performance suffers so that the 
eral the disk must rotate to a new orientation such that ^ of creating sorted runs is offset largely when- 
the desired data is under a disk arm and the read/write ever random disk reads are needed frequently during 
head on that disk arm must also move along the arm to batch wr *te operations. 

a new radial position at which the desired data is under 30 The traditional method for recovery in a mirrored 
the read/write head. Unfortunately performance of this disk system is to copy the data from the surviving disk 
physical operation, and therefore random disk Input- of the mirrored pair onto a spare backup disk. This is 
/Output (I/O) performance is not improving as fast as typically done by scanning the data on the surviving 
other system parameters such as CPU MIPS; There- disk, and applying any writes that come in during this 
fore, applications such as OLTP, where random access 35 process to both disks. One problem with this approach 
to data predominates, have become limited by this fac- is that it produces a significant degradation of the disk 
tor, which is referred to in this art as being disk arm system performance during recovery, 
bound. In systems which are disk arm bound, the disk 

cost is becoming an ever larger fraction of the system SUMMARY OF THE INVENTION 

cost. Thus, there is a need for a disk sub-system which 40 Accordingly, it is an object of this invention to im- 
can support a larger rate of random accesses per second prove the performance of mirrored disk systems by 
with a better pria-performance characteristic than is largely eliminating the penalty normally resulting from 
provided by transmonal disk systems the need to duplicate each disk write onto both disks of 

Both mirrored disk systems and RAID disk systems a mirrored pair of disks, 
(for Redundant Array of Independent Disks) havebeen 45 Itisalsoan object t0 provide a disk subsys . 

iS^!!^ tern that has improved performance for random disk 

. Z * ; he H m ^ or 7 Uon ° n I/O by converting random disk write I/O to close to 

duplicated on a second (and therefore redundant) disk. seauential I/O 

In a RAID array, the information at corresponding ' , . . AU .... 

block locations on "several disks isused to createVparit? 50 J 1 * ^ another object toimprove the mirrored disk 
block on another disk. In the event of failure, any one of *™S h P ut wiAout a penalty in read response time, 
the disks in a RAID array can be reconstructed from ItB ** ob ^ t0 TrA^J^-^ the 
the others in the array. RAID architectures require less ^very process from a failed disk, by providing guar- 
disks for a specified storage capacity, but mirrored disks meed Performance to disk reads and writes during 
generally perform better. In an article entitled "An 55 recovery, wMe retaining fast recovery, 
evaluation of redundant arrays of disks using an Am- These and further objects and advantages are 
dahl 5890," SIGMETRICS Conference on Measure- achieved by this invention by providing a fault-tolerant 
ment and Modeling of Computer Systems, pp. 74-85, disk storage subsystem of the mirrored disk type in 
Boulder, Colo., May 1990, P. Chen et aL showed that which updates (i.e., data blocks to be written) to disk 
mirrored disks are better than RAID-5 disk arrays for 60 are accumulated and scheduled into successive batch 
workloads with predominantly random writes (i.e., runs of updates, the scheduling being done to produce 
average read/write times for mirrored disk architec- an ordering which can be written efficiently to the 
tures are lower than for RAID-5 architectures when mirrored disks. The updates preferably, but not neces- 
random read/writes predominate). RAID-5 architec- sarily, are accumulated in a memory in the disk control- 
ture is described, for example, by D. Patterson et al. in 65 ler, and the scheduling is preferably, but not necessarily, 
"A case for redundant arrays of inexpensive disks," done by the disk controller for the mirrored disks. Prcf- 
ACM SIGMOD Int'I Conf. on Management of Data, erably the memory is either non-volatile or fault-toler- 
pp. 109-116, Chicago, 111. (June 1988); However, mir- ant 
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In a preferred embodiment, the scheduling is done by FIG. 9 is a flow chart which shows the steps involved 
sorting the updates in accordance with the home loca- in the background process that scans the survivor disk, 
tions of the updates on the mirrored disks (i.e., in accor- nFSrttlPTION OF A PRFFFRRED 

dance with the positions on disk at which the updates DESCRIPTION OF A F^FERRED 

will be written). This is an ordering which also corre- 5 dMBUUIMcN l 

sponds to a scan of a disk. FIG. 1 is a block diagram of a preferred embodiment 

The disks in each mirrored pair are then operated out of a computer system which incorporates a disk storage 
of phase with each other, one being in read mode while subsystem having mirrored storage disks which are 
the other is in write mode. A batch of writes is effi- alternately updated in a batch fashion with accumulated 
ciently applied each time to the disk in write mode in >° updates that have been sorted for efficient writing 
accordance with the scheduled order. Because the up- (henceforth sometimes called AMDU for Alternating 
dates are copied onto each disk of the mirrored pair in Mirrors with Deferred Updates) in accordance with 
accordance with the physical order on the disk, good to» invention. It includes a controller or I/O processor 
performance is achieved for applying the writes. (K>P) 200, a plurality of mirrored disk pairs 300-1 

Random writes are thus converted to largely sequen- 15 thxo ^ 300-N, at least one spare pair of disks 400 and 
tial writes to disk and clustering of data on the disk is a centraI processing unit (CPU) 100. The controller 200 
preserved. The average time to apply a write of a block is connected to the CPU 100 and has a processor 210 
using this method is typically less than half the time to 311(1 non-volatile memory 220. For simplicity we assume 
apply a random write of a block on the disk, thus largely ^ non-volatUe memory is partitioned into regions 
eliminating the problem of having to write a block 20 220-1 through 220-N, each corresponding to a rmrrored 
twice to a pair of mirrored disks. pa J£l . .„ . . , .„ _„ . . 

During this time, read requests are either handled by ™ ose skllled „ m the ?f ^readily appreciate that 
reading the data from the memory or from the disk that th f memory and controller need not constitute a sepa- 
is in read mode. Thus, guaranteed performance to read „ £ te Physical subsystem as illustrated, but could mstead 
requests is also achieved. When a batch of updates has 25 * ""P^ented wtfh software rum^g in the mam com- 

been applied to one of the disks of a mirrored pair (i.e., ? Utcr ? yStem * T?^ Ie 

rs - j \ *u TV •* u j T m order to achieve useful benefit from this mvention 

the onein write mode) the disks switch modes of opera- d . environments could be fault-tolerant as 

uon. There may also be a period of time when both t J ™ of ^J^cy «d a voting 

disks are m read mode between these two modes of 3Q mec ^ m) . & memory ^ 0 need aot be partitioned 

operation Also there may be tunes when both disks of ^ there ^ be more than one s pair </ disks . 

a mirrored pair are m wnte mode, as for example during ^ of disks hnotvsed m normal operation, but is 

loads and other large copying operations. used in the event one of the mirrored disks fails. 

Recovery from failure of one disk of a mirrored pair ^ minoied disk ^ consists oftwo disks labeled 

of disks is handled by mtroducmg a pair of spare mir- 35 300 . al m6 300 . bl for ^ pair ^ (correspondingly 

rored disks that are operated using the alternating mir- md 300 . bN for disk pair 30 O.N). The two disks 

rors scheme. During recovery, new writes are directed m mirrored pair contain basically identical data, 

to the spare disk pair. Reads are either handled from the However, as will be better appreciated from the follow- 

surviving disk or the alternating mirror spare disk pair. mg description, updates to each mirrored pair of disks 

In the background, spare cycles are used to scan and 40 m NOT made simultaneously as would be the case 

copy data from the surviving disk to the spare alternat- conventional mirrored disks. In accordance with 

ing mirror pair. This method provides fast recovery ^ e invention, updates are accumulated instead in the 

with guaranteed performance to both read and write non-volatile memory 220 and sorted into batches of 

requests during recovery. updates, which are applied to the two disks of a pair not 

BRIEF DESCRIPTION OF THE DRAWINGS ^ simultaneously, but rather first to one and then to the 

other. Furthermore, while the same updates are eventu- 

These, and further, objects, advantages, and features ally applied to each of the two disks of a pair (except for 

of the invention will be more apparent from the follow- updates that have become obsolete because they have 

ing detailed description of a preferred embodiment and been further updated in non-volatile memory 220 before 

the appended drawings in which: 50 being applied to both disks), the batches of updates 

FIG. 1 is an overall block diagram of a preferred ma de to each disk individually generally are not identi- 

embodiment of this mvention; cal since they are applied at different times and more 

FIG. 2 illustrates a preferred organization of data in recent blocks of data are included in the individual 

the non-volatile memory of the I/O processor; batch for each disk of a pair 

FIG. 3 is a flow chart which shows the steps involved 55 FIG. 2 shows on region of non-volatile memory 220- i 

in processing a write request during normal operation; in more detail. The region has a number of data blocks 

FIG. 4 is a flow chart which shows the steps involved labeled 221-1 through 221-k. Corresponding to each 

in the process of applying a batch of writes to a disk data block are two tags labeled 222-1 and 2223-1 (for 

during normal operation; block 221-1) through 222-k and 223-k (for block 221-k). 

FIG. 5 is a timing diagram which shows the timing 60 The two tags for each data block correspond to the two 

relation between the two processes applying write disks in the mirrored pair and indicate whether the 

batches to two mirrored disks; corresponding disk must still write the data block. 

FIG. 6 is a flow chart which shows the steps involved In each region of the non-volatile memory there is 

in processing a read request during normal operation; also a list of pointers labeled 225-1 through 225-L. Each 

FIG. 7 schematically illustrates the configuration 65 pointer points to a data block in some region of non- 

during recovery of a failed disk; volatile memory. The order of the pointers in the list 

FIG. 8 is a flow chart which shows the steps involved indicates the order in which the blocks should be writ- 

in servicing a read request during recovery; ten on the disk to achieve efficiency. 
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The non-volatile memory acts as a cache for data reached the end of the list (block 620), If so, it goes to 

blocks to be written to disk. Those skilled in the art will sleep (block 680) and wakes up for the next period. If 

readily appreciate that this cache may be managed like there are more blocks, it checks the tag of the current 

any other cache, such as by using a hash table (to deter- block in the pointer list (block 630). If the tag corre- 

mine which blocks are present in the cache) and a free 5 sponding to the process's disk is off, the process moves 

list (to chain free blocks). The hash table and the free list to the next pointer in the list (block 670). If the tag is on, 

are not shown in FIG. 2. the process writes the block on disk and turns the tag off 

In the preferred subsystem in accordance with this (block 640). After turning the tag off, it checks the other 

invention, there are four processes occurring during tag (block 650). If the other tag is on, the process moves 

normal operation. The first process services write re- 10 to the next pointer in the list (block 670). If the other tog 

quests and is shown in FIG. 3. When a write request is off, both disks have applied the update, so the pointer 

arrives (block 510), the non-volatile memory in the is removed from the list (block 660). The block is still 

controller is checked for the presence of an old version valid (a read to that block will still get a cache hit), but 

of that block (block 520 in FIG. 3). If the previous it is free and can be overwritten by a subsequent write 

version of that block is already in the non- volatile mem- 15 of any block. Then the process moves to the next 

ory region for the corresponding pair of disks, the pre- pointer in the list (block 670). 

vious version in memory is overwritten (block 530) and The two processes applying the updates to the two 

both tags corresponding, to that block are turned on disks have the same logic and preferably the same per- 

(block 560) to indicate that both disks must install the iod, i.e., the time that elapses between two consecutive 

new version of that block. 20 activations of the write phase of the process preferably 

If an old version of the block to be written is not is the same for the two processes. This period is called 

found in the non-volatile memory, the controller looks T in the described embodiment. The two processes need 

for free space in its non-volatile memory in which to not be synchronized, but they are illustrated at a phase 

temporarily store the block to be written (block 540). If difference of 180 degrees in FIG. 5. The process for disk 

there is a free space for the block, the data block defined 25 a wakes up at times 0, T, 2T, 3T etc., and begins to write 

by the write request is written into the free space and a a batch of updates to a disk a until completed and then 

pointer to the new block is inserted into the pointer list .switches to read mode, while the process for disk b 

(block 550) in a position (relative to other pointers in the wakes up at times T/2, T+T/2, 2T+T/2, 3T+T/2, 

list) such that the list represents a schedule for effi- etc., and begins to write a batch of updates to disk b 

ciently writing the pointed-to data blocks to disk. The 30 until completed and then switches to read mode. FIG. 5 

corresponding tags are turned on to indicate that both illustrates this with a timing diagram, where the high 

disks must perform a write of the new data block (block value indicates time periods during which the writing 

560). process is active for the corresponding disk. While the 

If the data block to be written is not in the non- writing process is asleep (inactiveX the corresponding 

volatile memory and there is no free block space avail- 35 disk may service random read traffic, 

able, the new block is written synchronously to both As illustrated in FIG. 5, this results in three different 

disks (block 570). This situation will not occur normally controller modes, namely controller mode 1 where disk 

(other than maybe for loads and other very large copy- a is in write mode and disk b is in read mode, controller 

ing operations) if the non-volatile memory is large mode 2 where disk b is in write mode and disk a is in 

enough to absorb heavy bursts of write activity, but the 40 read mode, and controller mode 3 where both disk a 

action to take in the event a write request is encountered and disk b are in read mode. As mentioned earlier in 

and there is no free space in the non-volatile memory connection with the description of block 570 of FIG. 3, 

must be specified anyway. there is also a controller mode 4 where both disk a and 

The list of pointers 225 defines an order or schedule disk b are in write mode. Controller mode 4 cannot 

for the data blocks (in the non-volatile memory) cov- 45 occur so long as the batch write completes each time in 

ered by that list. This order or schedule is created pref- less than time T/2. The system is preferably designed so 

erably such that if a disk accesses the blocks in that that the situation where both disks are in write mode 

order, the total time to access all the blocks (and write simultaneously is largely avoided, which is done by 

them to disk) will be minimized. As a first approxima- making the design such that batch writes will complete 

tion, the ordering may be by cylinder, so that a scan 50 in less than time T/2. 

(sweep) through the entire disk can apply all updates in Keeping the two processes at phase difference of 180 

one pass. All blocks in a particular cylinder are written degrees ensures that if writes can be applied in less than 

before moving on to the next cylinder. More elaborate half a period, there is always one disk arm dedicated to 

schemes could order the blocks within a cylinder to servicing random reads, which allows batches to be- 

minimize the rotation latency for the cylinder. More 55 come large (so that writes can gain efficiency), without 

sophisticated schemes may take into account the combi- hurting response time for reads. The period T is a sys- 

nation of seek time and rotational delay. Such schemes temrdependent parameter, primarily determined by the 

are described, for example, by M. Seltzer et al. in "Disk amount of memory available, since the writes accumu- 

Scheduling Revisited", Winter 1990 USENDC, pp. lating in a period T should fit in memory. 

313-323, Washington, D.C. (Jan. 1990). 60 The logic for the process servicing read requests is 

There are two processes (one for each disk), which shown in FIG. 6. When a read request arrives (block 

periodically wake up and apply the updates pending in 810), the non-volatile memory is checked (block 820) 

non-volatile memory to the corresponding disk. The for the presence of the block to be read. If the block is 

logic for applying the updates is identical for the two in memory, it is returned immediately (block 830). If it 

processes and is shown in FIG. 4. When the process 65 is not in memory, a check is made (block 840) as to 

wakes up (block 610), it goes to the beginning of the whether both disks are currently servicing read requests 

pointer list and traverses the pointer list examining each (Le., whether the controller is in controller mode 3). If 

block in order. In each step it checks to see if it has not, the request is served by the disk that is currently in 
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read-only mode (block 850), i.e., the disk whose write mode, the request is served by that replacement (block 

process is inactive. If both disks are in read-only mode 954). If both replacements are in read mode, the request 

(i.e., controller mode 3), the request may be serviced by preferably is serviced by the disk whose arm is closest 

either disk, but preferably will be serviced by the disk to the requested block (block 952). If the block is read 

whose heads are closest to the target block (block 860). 5 from a replacement disk, there is no need to update the 

Those skilled in the art will readily appreciate that bit map or store the block in the cache, 

some routine synchronization (e.g., latching) is required The fourth process services writes during recovery 

to preserve the integrity of the shared data structures and involves exactly the same steps shown in FIG. 3. In 

(e.g., tags, pointer list) accessed simultaneously by more addition, in all cases, the bit corresponding to the block 

than one of the above processes. Also, the pointer order 10 written is set in the bit map (if not already set) to indi- 

that miriimizes the time to write a batch may be differ- cat e that it is no longer necessary to copy that block 

ent for the two disks, since each disk generally writes to t jj C survivor. 

a disk a different subset of the blocks stored in the non- ^ fifth pr0 cess (shown in FIG. 9) is a background 

volatile memory process that scans the blocks in the survivor that have 

Operation under a failure scenario will now be de- 15 not yet been written on the replacements. The process is 

scribed and is illustrated Im FIG. 7. Assume that disk started (block 1000) when the system enters the recov- 

300-bg in mirrored pair 300-g fails. Tradmond recovery ery mode ^ the replacement disks are activated. The 

scheme^ use one repkeement, disk, onto which the con. ^ waits ^ ^ survivor becomes jdle (btock 

tents of the surviving disk 300-ag are copied to replace 1010) ■ ^ ±CTCJxrcn0 random rcad rcqucsts ^ 

thelost mirrored disk and therefore restore the mix- 20 m {oT it ^ it checks if ^ m mscanned blocks 

rored pair The preferred recovery scheme m accor- ^ 102Q) . blocks for which ^ bit m the bit 

dance with this mvention utikzes a pair of replacement ^ ^ ^ ff ^ blocks have ^ recov P 

t £? ™ ' ^ a J m ^ rep a ^ ery is complete and the process terminates (block 1060); 

ST™ ^"J^Z ^^^Jf^ZZZ „ the survivor c^bere^ 

TtZlX^^r S g If ^ere are unscanned blocks, the process checks if 

to the system for other use. A , . « . - , A ., r , 

For the duration of recovery, the survivor stays in * fre * s P ace m f *■ "^t^T^ ^ 
read-only mode. The survivor does not get involved in » °>' F «* * to sIee P <!*>? k 104 °) f ° r * certam 
servicing writes. Those skilled in the art will readily mXerv * there 15 fr f ff** m n ^/°! a ^ m * m * 
appreciate that a bit map (labeled 230 in FIG. 7) stored 30 °f* ** P rocess rcads scanned block which is 
in the non-volatile memory can be used to keep track of ^"f J*!^ J?"? 1 p0S1 ? on °! ^esurvi \ 0T s 
which blocks remain to be retrieved from the survivor g*** 1050 >- 71145 bl ' » "f 1 to determine which 
before recovery completes. The bit map has one bit per blocks » unscanned. The block read is placed in the 
disk block and all bits are clear when recovery starts. non- volatile memory, and both of its tags are turned on 
Alternatively, the bit map can be stored in other mem- 35 to indicate that the replacements must write the block, 
ory components of the system. A P ointer » ^° inserted in the pointer list. Further- 
In total, there are five processes involved in recov- more » ^ corresponding bit in the bit map is set to 
ery. Two of the processes (one for each disk) periodi- indicate that the survivor does not need to scan that 
cally wake up and apply writes pending in the cache to bIock ^ process then repeats the above steps 
the corresponding disk. The processes have the same 40 fe oes t0 block 1010 )- ^ a random read arrives, the pro- 
period and are maintained at a phase difference of 180 cess » suspended in block 1010 until the read completes, 
degrees. The logic of these processes is identical to that skilled in the art will readily appreciate that 
for normal operation shown in FIG. 4. The third pro- there m opportunistic strategies which the survivor 
cess services reads and is shown in FIG. 8. When a read can use to further expedite the recovery process, 
request arrives (block 910), the memory is checked 45 For example, whenever the survivor disk services a 
(block 920) for the presence of the block to be read. If random read request, it could also read any unscanned 
the block is present in memory, it is returned immedi- ( i e » uncopied) blocks that happen to pass under its arm 
ately (block 930). If the block is not in the non-volatile while it is waiting for the disk to rotate to the targeted 
memory, it must be read from disk. The bit map is first block. Furthermore, the process shown in FIG. 9 obvi- 
checked to see if the block is available on the replace- so ously could read more than one block at a time, 
ment disks (block 940). If not, the read is serviced by the We claim: 

survivor (block 960). After the block is read from the 1« A fault-tolerant disk storage subsystem for storing 

survivor, the process checks if there is free space in the data blocks of digital information for a computer sys- 

non-volatile memory (block 970). If not, the process tern, comprising: 

ends. If there is free space in the non-volatile memory, 55 a mirrored pair of disks for storing data blocks of 

the block is also placed in the non- volatile memory, a digital information in duplicate on both disks of 

pointer is inserted in the list and both tags are turned on said mirrored pair; and 

(block 980), so that the disks will write it in their next a controller for said mirrored pair of disks, said con- 

write phase. Furthermore, the bit in the bit map corre- troller having a memory;, said controller compris- 

sponding to that block is turned on to indicate that there 60 big: 

is no longer a need to extract that block from the survi- means for temporarily accumulating in said memory 
vor. until storage thereof in duplicate on both disks of 
If on a read request the block is not in non-volatile said mirrored pair a multiplicity of data blocks 
memory and the bit map shows that the block is avail- provided by said computer system as separate 
able on the replacement disks, the block is read from 65 writes to the disk storage subsystem; 
one of the replacements. The process preferably checks means for identifying each block stored in said mem- 
show many replacements are in read mode at that mo- ory that has not yet been stored on one disk of said 
ment (block 950). If only one replacement is in read pair and for identifying each block stored in said 
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memory that has not yet been stored on the other 9. A fault-tolerant disk storage subsystem as defined 
disk of said pair; in claim 1 wherein said controller further includes 

means for sorting said accumulated data blocks that means for providing a third mode of operation during 
have not yet been stored on said one disk into an which both of said disks of said mirrored pair are in read 
order that can be efficiently written onto said one 5 mode and a requested data block may be retrieved from 
disk in a batch run and for sorting said accumulated either of said disks of said mirrored pair in the event said 
data blocks that have not yet been stored on said requested data block is not in said memory, 
other disk into an order that can be efficiently writ- 10. A fault-tolerant disk storage subsystem as defined 
ten onto said other disk in another batch run; in claim 1 wherein said controller further includes 

means for providing a first mode of operation in 10 means for operating said mirrored pair of disks in said 
which said one disk is in a write-only mode and third mode of operation whenever said first mode of 
said sorted accumulated data blocks that have not operation is terminated prior to a next scheduled start of 
been stored on said one disk are written in batch ^ second mode of operation and whenever said sec- 
mode onto said one disk, while said other disk ond mode of operation is terminated prior to a next 
serves said computer system in a read-only mode 15 scheduled start of said first mode of operation, 
and writes from said computer system are received 11 A fault-tolerant disk storage subsystem as defined 
into said memory; m claim 9 therein said controller includes means for 

means for providing a second mode of operation in detennining during said third mode of operation, when 
which said other disk is in a writeK>nly mode and a block * by said computer system and is 

said sorted accumulated data blocks that have not 20 ^present jj said memory, which one of said disks in 
been stored on said other disk are written in batch said mirrored pair can deliver said requested data block 
mode onto said other disk without interruption, ,n time and means for retoevmg said re- 

while said one disk serves said computer system in « u ?f? * * determined disk, 

a read-only mode and writes from said computer „ . 12 ; A fa ^-tolerant disk storage subsystem as defined 
system are received into said memory; 25 ,n claim 1 whe 5 em * OT *° ller r furth *f "\ cludes 

means for operating said mirrored pair of disks in said m f^ ^ Pro viowg a third mode of operation during 
f " , * . V y ' * which both of said disks of said mirrored pair are oper- 

nrst mode of operation during spaced tune inter- m w t 0 ^ t 

vak and to said second mode of operation during at „ A fault . tolerant disk storage subsystem ^ defmed 
least a portion ofthe time between said spaced time JQ m claim n wherein ^ ^oUer further includes 
in erv s, an means for operating said mirrored pair of disks in said 

means for providing a requested data block to said third mode of operation whenever aU accumulated data 
computer system from said memory if said re- Wocks ^ haye not ^ store<J on ^ Qne ^ at the 
quested data block is m said memory, and other- start of ^ first mode D f operation have not been writ- 
wise from said other disk if said mirrored pair of 35 ten to ^ one ^ by the ^ ^ mode of 
disks is operating in said first mode of operation opcration U scheduled to start again and whenever all 
and from said one disk if said mirrored pair of disks accumulated data blocks that have not been stored on 
is operating m said second mode of operation, ^ other disk at ^ stm of ^ second mode of opera- 

whereby data blocks are written onto said mirrored ^ OJi have not ^ to ^ other disk by ^ ^ 

pair of dfcks in sorted order in batched runs with- ^ ^ first mode of operation is scheduled to start again, 
out mterference from or with the reading of data 14# A f au lt-tolerant disk storage subsystem as defined 
blocks requested by said computer system. m claim t wherein said subsystem includes a spare pair 

2. A fault-tolerant disk storage subsystem as defined of stor age disks and said controller includes means for 
m claim 1 wherein said controller is implemented by augmenting said mirrored pair of storage disks with said 
software in said computer system and said memory of 45 spare pair of storage disks during a recovery mode of 
said controller is a portion of the general storage re- operation in the event that only one of said disks of said 
sources of said computer system. mirrored pair remains operational 

3. A fault-tolerant disk storage subsystem as defined 15. a fault-tolerant disk storage subsystem as defined 
in claim 1 wherein said controller and memory are m c iaim 14 wherein said controller includes means for 
implemented with dedicated hardware. 50 placing said disk of said mirrored pair which remains 

4. A fault-tolerant disk storage subsystem as defined operational in read-only mode continuously during said 
in claim 1 wherein said memory is non-volatile. recovery mode of operation until all blocks on said 

5. A fault-tolerant disk storage subsystem as defined remaining operational disk have been transferred either 
in claim 1 wherein said memory is fault-tolerant to said memory or to one or both of the disks of said 

6. A fault-tolerant disk storage subsystem as defined 55 spare pair, and means for replacing said mirrored pan- 
in claim 1 wherein said means for operating said mir- with said spare pair when all blocks on said remaining 
rored pair of disks in said first mode of operation sched- operational disk of said niirrored pair have been trans- 
ules said first mode of operation to start periodically. ferred. 

7. A fault-tolerant disk storage subsystem as defined 16. A method of storing data blocks of digital infor- 
in claim 1 wherein said means for sorting said accumu- 60 mation received from a computer system in a storage 
lated data blocks sorts said data blocks into an order subsystem having a mirrored pair of storage disks and 
which corresponds to a physical scan of an entire disk. for retrieving data blocks from said storage subsystem 

8. A fault-tolerant disk storage subsystem as defined upon request from said computer system, comprising 
in claim 1 wherein said controller includes means for the steps of: 

making a single scan through said accumulated data 65 temporarily accumulating a group of data blocks 
blocks that have not yet been stored on said one disk in received from said computer system in the form of 

sorted order during said first mode of operation before separate writes as batches of data blocks to be 

changing modes. stored; 
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sorting said accumulated data blocks in each batch in if said requested data block is in said accumulated 

an order for efficient batch writing said mirrored group, and otherwise from said other disk if said 

pair of disks; mirrored pair of disks is operating in said first mode 

operating said mirrored pair of disks in a first mode of of operation and from said one disk if said mirrored 

operation in which one disk of said mirrored pair is 5 pair of disks is operating in said second mode of 

in write-only mode while the other disk of said operation, 

mirrored pair is in read-only mode and in a second whereby data blocks are written onto said mirrored 

mode of operation in which said one disk is in pair of disks in sorted order without interference 

read-only mode while said other disk is in write- from or with the reading of data blocks requested 

only mode; 10 by said computer system. 

copying onto said one disk during said first mode of 17. A method of storing data blocks of digital infor- 

operation a batch of accumulated and sorted data mation as defined in claim 16 and further comprising the 

blocks in said accumulated group that have not step of deleting from said accumulated group any data 

been already written to said one disk; blocks that have been written to both of said disks of 

copying onto said other disk during said second mode 1 5 said mirrored pair, 
of operation a batch of accumulated and sorted 18. A method as defined in claim 17 and further corn- 
data blocks in said accumulated group that have prising the step of associating first and second flags with 
not been already written to said other disk; each accumulated data block in said group, said first 

operating said mirrored pair of disks in said first mode flag associated with any particular data group indicat- 

of operation during spaced time intervals and in 20 ing whether or not said particular data group has been 

said second mode of operation during at least a copied to said one disk and said second flag associated 

portion of the time between said spaced time inter- with said particular data group indicating whether or 

vals; and not said particular data group has been copied to said 

retrieving a data block requested by said computer other disk, 

system from said accumulated group of data blocks 25 ***** 
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ABSTRACT 



When new data for writing is sent from a host device, old 
data and old parities are read after a search time respectively, 
and a new parity is generated with the new data, the old data 
and the old parities, and the new parity is stored in a cache 
memory, and when the number of the new parities corre- 
sponding to a plurality of write data becomes more than a 
predetermined value set by a user or when there is a period 
of time in which no read request or no write request is 
issued, new parities are collectively written to a drive for 
storing parities. In this case, a plurality of new parities are 
written in a series of storing positions, where a plurality of 
old parities are stored, in a predeterrnined access order 
independent of the stored positions of corresponding old 
parities. At least to a plurality of storing positions in a track, 
these new parities are written in the order of positions in a 
track. To the storing positions which belong to a different 
track or to a different cylinder, new parities are written in the 
order of tracks or cylinders. 
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FIG. 12 
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DISK ARRAY DEVICE AND METHOD OF 
UPDATING ERROR CORRECTION CODES 
BY COLLECTIVELY WRITING NEW ERROR 
CORRECTION CODE AT SEQUENTIALLY 

ACCESSIBLE LOCATIONS 5 

BACKGROUND OF THE INVENTION 

The present invention relates to a method of updating 
error correcting codes and a disk array device which is iq 
suitable to the updating method, the disk array device 
comprising a plurality of disks for holding data and a disk 
for holding error correcting codes for the data held in the 
plurality of disks. 

In a present day computer system, the data needed by a 15 
higher order device, such as a CPU, etc., is stored in a 
secondary storage, and the CPU, as occasion demands, 
performs read or write operations for the secondary storage. 
In general, a nonvolatile storage medium is used for the 
secondary storage, and as a representative storage, a disk 20 
device (hereinafter referred to as a disk drive) or a disk in 
which a magnetic material or a photo-electromagnetic mate- 
ria] is used can be cited. 

In recent years, with the advancement of computerization, 
a secondary storage of higher performance has been 25 
required. As a solution, there has been proposed a disk array 
constituted with a plurality of drives of comparatively small 
capacity. For example, referenced is made to "A Case for 
Redundant Arrays of Inexpensive Disks (RAID)'* by D. 
Patterson, G. Gibson, and R. H. Kartz read in ACM SIG- 30 
MOD Conference, Chicago, 111., (June, 1988) pp 109 to 1 16. 

In a disk array which is composed of a large number of 
drives, due to the large number of parts, the probability of 
occurrence of a failure becomes high. Therefore, for the 
purpose of increasing the reliability, the use of an error 35 
correcting code has been proposed. In other words, an error 
correcting code is generated from a group of data stored in 
a plurality of data disks, and the error correcting code is 
written to a different disk than the plurality of data disks. 
When a failure occurs in any of the data disks, the data in the 40 
failed disk is reconstructed using the data stored in the rest 
of the normal disks and the error correcting code. A group 
of data to be used for generating the error correcting code is 
called an error correcting data group. Parity is mostly used 
as the scheme for an error correcting code. In the following, 45 
parity is used for an error correcting code, except for a case 
under special circumstances; however, it will be apparent 
that the present invention is effective in a case where an error 
correcting code other than one based on parity is used. When 
parity is used for an error correcting code, the error correct- 50 
ing code data group can be also referred to as a parity group. 

The above-mentioned document reports the results of a 
study concerning the performance and the reliability of a 
disk array Gevel 3) in which write data is divided and written 
to a plurality of drives in parallel, and of a disk array (level 
4 and 5) in which data is distributed and handled indepen- 
dently. 

In a present day large scale general purpose computer 
system or the like, in a secondary storage, constituted with ^ 
disk drives, the addresses of individual units of data which 
are transferred from a CPU are fixed to predetermined 
addresses, and when the CPU performs reading or writing of 
the data, the CPU accesses the fixed addresses. The same 
thing can be said about a disk array. ^ 

In the case of a disk array (level 3) in which data is 
divided and processed in parallel, the fixing of addresses 



exerts no influence upon the disk array; however, in the case 
of a disk array (level 4 and 5) in which data is distributed and 
handled independently, when the addresses are fixed, a data 
writing process is followed by a large overhead About the 
overhead, an explanation has been given in Japanese Patent 
Application No. Hei 4-230512; in the following also, the 
overhead will be explained briefly in the case of a disk array 
(level 4) in which data is distributed and handled indepen- 
dently. 

In FIG. ISA, each address (i j) is an address for a unit of 
data which can be processed in read/write operation of one 
access time. 

Parity is constituted by a combination of data composed 
of 4 groups of data in each address (2,2) in the drives from 
No. 1 to No. 4, and the parity is stored in a corresponding 
address (2,2) in the drive No. 5 for storing parity. For 
example, when the data in the address (2,2) in drive No. 3 
is to be updated, at first, the old data before the update in the 
address (2,2) in drive No. 3 and the old parity in the address 
(2,2) in the parity drive No. 5 are read (1). 

An exclusive-OR operation on the read old data, the old 
parity, and the updated new data is carried out to generate a 
new parity (2). After the generation of the new parity is 
completed, the updated new data is stored in the address 
(2,2) in the drive No. 3, and the new parity is stored in the 
address (2,2) in the drive No. 5 (3). 

In the case of a disk array of level 5, in order to read out 
the old data and the old parity from the drive on which data 
is stored and from the drive on which the parities are stored, 
disk rotation is delayed by Vi turn on the average, and from 
the disk the old data and the old parity are read out to 
generate a new parity. 

As shown in FIG. 15B, one turn is needed to write the 
newly generated parity at the address (2,2) in the drive No. 
5. A latency time also is needed to write the new data at the 
address (2,2) in the drive No. 3. In conclusion, for the 
rewriting of data, at a minimum, a latency time of 1.5 turns 
is needed. In the case of the RAID 4, since a plurality of 
parities for the data in a plurality of parity groups are stored 
on the same disk, a latency time of one turn needed when a 
new parity is written causes a degrading of the performance 
in writing. Even if the write time of new data increases, the 
data access for the data stored on other disks can be 
performed independently, in principle, so that the influence 
of the overhead on the write time of data is smaller in 
comparison with the overhead involving an update of parity. 

In order to reduce the overhead during write time as 
described above, a dynamic address translation method may 
be employed, as disclosed in PCT International Application 
laid open under WO 91/20076, applied by Storage Technol- 
ogy Corporation (hereinafter referred to as STK). 

In Japanese Patent Application No. Hei 4-230512, applied 
for by IBM, there is disclosed a method for reducing the 
write overhead by writing data at an address other than the 
address at which the write data is to be written. 

On the other hand, in recent years, a flash memory has 
been suggested as a replacement for the magnetic disk. Since 
a flash memory is a nonvolatile memory, the reading or 
writing of data in the flash memory can be performed faster 
in comparison with that in a magnetic disk. In the case of a 
flash memory, however, when data is to be written, other 
data existing at the receiving address has to be erased. In the 
case of a representative flash memory, the write time or the 
read time is in the order of 100 ns, similar to the case of the 
RAID, but it takes about 10 ms for an erase time. Also, there 
is a limit to the number of times writing may be carried out, 
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and the limit is said to be about one million times, which is 
regarded as a problem in practical use. 

In order to solve the above-mentioned problem concern- 
ing the limit in the number of times writing is possible in the 
case of a flash memory, a method in which address trans- 
lation is performed during the write time so that the number 
of writing times to flash memories can be averaged with the 
use of a mapping table is disclosed by IBM in Japanese 
Patent Application No, Hei 5-27924. 

SUMMARY OF THE INVENTION 



15 



20 



25 



30 



35 



In the methods described in the above-referenced patent 
applications by STK and IBM, the overhead at the time of 
writing data and new parities can be reduced by use of a 
dynamic address translation method; however, in order to 
realize such a technique, the processes for the management 
of space regions (i.e. regions currently not in use) and used 
regions on the disk must be increased. Therefore, it is 
desirable to decrease the write overhead using a simpler 
method. 

An object of the present invention is to offer an error 
correcting code updating method in which the overhead for 
a writing process in a disk array can be decreased with a 
simple method. 

Another object of the present invention is to decrease the 
overhead for a writing process of an error correcting code by 
storing the error correcting code in a flash memory in a disk 
array. 

In order to achieve these objects, in a desirable mode of 
operation of a disk array according to the present invention, 
the following operations are performed: a new correcting 
code is generated for respective requests for writing issued 
by a host device, the new error correcting codes are tem- 
porarily held in a random access memory, and the new error 
correcting codes are written in a proper order to a disk being 
used for correcting codes. In this case, individual new error 
correcting codes are written at the positions in which a 
plurality of old error correcting codes were held in the 
positional access order. The access order of a plurality of 40 
storing positions in a track is determined according to the 
order of the positions in the direction of rotation. To the 
storing positions in the different tracks or cylinders new 
correcting codes are written in a proper order, for example, 
in the order of the tracks or in the order of the cylinders. 

Thereby, it is possible to write a plurality of new error 
correcting codes in a short time. 

In another desirable mode of operation of a disk array 
according to the present invention, the following operations 
are performed: before the groups of new error correcting 
codes arranged in a sequential order are written to a drive for 
error correcting codes, a plurality of effective error correct- 
ing codes which do not need updating, because no write 
request is issued for them, are read from the error correcting 
disk in order, the read error correcting codes which do not 
need updating are held in a flash memory together with the 
above-mentioned new error correcting codes, and the series 
of error correcting codes are written onto the disk for error 
correcting codes. 

In a further desirable mode of operation according to the 
present invention, a flash memory which has a short access 
time is used in place of a disk for storing error correcting 
codes, and a series of new error correcting codes are written 
to the flash memory according to one of the collective- 
writing procedures of error correcting codes as described 
above. 



BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a schematic diagram showing the hardware 
constitution of the embodiment 1. 

FIG. 2 is a schematic block diagram showing the internal 
constitution of the channel path director (5) and the cluster 
(13) shown in FIG. 1. 

FIG. 3 is a chart showing the details of the address 
translation table (40) to be used for the device shown in FIG. 
10 1. 

FIG. 4 is a flowchart of a data writing process and a parity 
updating process in the embodiment 1. 

FIG. 5 is a conceptional illustrative representation of the 
data writing process and the parity updating process in the 
embodiment 1. 

FIG. 6 is a timing chart of the writing process in the 
embodiment 1. 

FIG. 7 is an illustrative diagram for explaining a sequen- 
tial writing method of new parities in the embodiment 1. 

FIG. 8 is an illustrative diagram for explaining a sequen- 
tial writing method of new parities in the embodiment 2. 

FIG. 9 is an illustrative diagram for explaining region 
division in the parity drive in the embodiment 3. 

FIG. 10 is an illustrative diagram showing the regions for 
parities distributed in a plurality of data drives in the 
embodiment 4. 

FIG. 11 is a schematic block diagram for explaining the 
constitution of a device and the writing process in it in the 
embodiment 5. 

FIG. 12 is a schematic block diagram showing the hard- 
ware constitution in the embodiment 6. 

FIG. 13 is an illustrative diagram for explaining the 
address in the flash memory used in the embodiment 8. 

FIG. 14 is a flowchart for judging the number of writing 
times in the flash memory used in the embodiment 8. 

FIG. 15A is an illustrative diagram for explaining the 
procedure of a data updating process in a conventional level 
4 disk array. 

FIG. 1SB is a time chart of a data updating process in a 
conventional level 4 disk array. 

DESCRIPTION OF EMBODIMENTS 
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(Embodiment 1) 

FIG. 1 shows a first embodiment of a disk array system 
according to the present invention. The constitution itself of 
the device is known to the public, and the present invention 
is characterized in the procedure for the collective storage of 
a plurality of new parities generated for a plurality of data 
writing requests in a drive for parities, so that the constitu- 
tion of the device will be explained briefly to the extent 
necessary for understanding the characteristic features of the 
invention. 

A disk array system is composed of a disk array controller 
2 and a disk array unit 3 which are connected to a host 
device, CPU 1, by way of a path 4. The disk array unit 3 is 
composed of a plurality of logical groups 10. As explained 
later, the disk array system in the present embodiment is 
arranged to be able to access four drives in parallel. Fol- 
lowing the above, each logical group 10 in the present 
embodiment is composed of four drives 12; however, in 
general, the logical group 10 may be composed of m drives 
(m: an integer larger than 2). These drives are connected to 
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four disk array unit paths 9 by a drive switch 100. There is 
no special limit in the number of the drives 12 which 
constitute each logical group. It is assumed in the present 
invention that each logical group 10 is a failure reconstruc- 
tion unit, and a plurality of drives 12 in a logical group 10 5 
are composed of three drives for data and a drive for holding 
parities generated from the data in the drives. 

The disk array controller 2 comprises a channel path 
director S, two clusters 13, and a cache memory 7. The cache 
memory 7 is constituted by semiconductor memories which 10 
are made to be nonvolatile by a battery backup, etc. There 
are stored in the cache memory 7, data to be written to any 
one of the drives, or data read from any one of the drives, 
and an address translation table 40 which is used for the 
access to each of the drives in the disk array. The cache 15 
memory 7 and the address translation table 40 are used in 
common by all clusters 13 in the disk array controller 2. 

The cluster 13 provides a gathering of paths which can be 
operated independently from each other in the disk array 
controller 2, and each cluster has its own power supply and 20 
circuit, which are independent from those of other clusters. 

The cluster 13 is constituted with two channel paths 6 and 
two drive paths 8; there are four channel paths between the 
channel path director and the cache memory 7, and there are 
also four drive paths between the cache memory 7 and drives 25 
12. Two channel paths 6 and two drive paths 8 are connected 
through the cache memory 7. 

FIG. 2 shows the internal constitution of the channel path 
director 5 and a cluster 13. 

As shown in FIG. 2, the channel path director 5 is 30 
constituted with a plurality of interface adapters 15 which 
receive the commands from the CPU 1 and a channel path 
switch 16. 

Each of the two channel paths 6 in each cluster 13 is 35 
composed of a circuit for transferring commands and a 
circuit for transferring data. The former is composed of a 
microprocessor (MP) 20 for processing the commands from 
the CPU 1, a channel director 5, and a path 17 for transfer- 
ring commands being connected to the microprocessor 20, 40 
etc. The latter is composed of a channel interface circuit (CH 
IF) 21, which is connected to the channel path switch 16 by 
line 18, and which takes charge of the interface with the 
channel director 5, a data control circuit (DCC) 22, which 
controls the transfer of data, and a cache adapter circuit (C 45 
Adp) 23. The cache adapter circuit 23 performs reading of 
data or writing of data for the cache memory 7 under 
command of the microprocessor 20, and also it performs 
monitoring of the condition of the cache memory 7 and 
exclusive control for the write requests and read requests. 50 

Each of the two drive paths 8 in each cluster 13 is 
composed of a cache adapter circuit 14 and a drive interface 
circuit 24. The latter is connected to each logic group 10 
through one of the four disk array unit paths 9. 

Further in each cluster 13, there are contained two parity 55 
generator circuits 25, which are connected to the cache 
memory 7. 

(Address translation table 40) 

In the present embodiment, it is assumed that the path 9 
is constituted as a SCSI path. 60 

Further, in the present embodiment, a plurality of the 
drives 12 are used as data drives, except for one drive, in the 
logic groups 10, and the one drive is exclusively used for a 
specified parity, such as RAID 4. 65 

Each drive comprises a plurality of disks and a plurality 
of heads which are provided corresponding to the disks, and 



tracks which belong to different disks are divided into a 
plurality of cylinders, and a plurality of tracks which belong 
to a cylinder are disposed to be accessible in order. 

The CPU 1, by means of a read command or a write 
command, designates the data name as a logical address of 
the data which is to be dealt with by the command. In the 
present embodiment, the length of data designated with a 
write command from the CPU 1 is assumed to have a fixed 
length. The logical address is translated to an Addr in a 
physical address SCSI in one of the actual drives 12 by 
either of the microprocessors 20 in the disk controller 2. 

To be more specific, the Addr in the SCSI comprises: the 
number of a drive in which request data is stored, the 
cylinder address, the number of a cylinder in the above- 
mentioned drive 12, a head address, the number of a head 
which selects a track in a cylinder, and a record address 
which expresses the position of data in a track. 

As shown in the following, a table 40, hereinafter referred 
to as an address table (FIG. 3), is used for the address 
translation. The table 40 is stored in the cache memory 7. 

The address table 40 holds the address information for 
individual parity groups held by all logical groups in the disk 
array. In other words, the address table 40 holds the address 
information about a plurality of units of data which consti- 
tute each parity group and the address information about 
parities of the parity group as a set 

The address table 40 holds for each unit of data in one of 
the parity groups: a logical address 27 of the data, a 
nullifying flag 28 which is turned ON (1) when the data is 
invalid, the number of a data drive 29 (D Drive No.) in 
which the data is stored, the Addr 30 in the SCSI which 
expresses a physical address in the drive in which the data 
is actually stored, a cache address 31 which expresses the 
address in the cache memory 7 when the data is present in 
the cache memory 7, and a cache flag 32 which is turned ON 
(1) when the data is present in the cache memory 7. In the 
present embodiment, it is assumed that a plurality of units of 
data which belong to the same parity group are held in the 
Addr 30 in the same SCSI in each drive 12 which constitutes 
one of the logical groups 10. 

Further, the table 40 comprises for the parity group: a P 
logical address 33 which is a logical address of a parity of 
the parity group, the number of a drive (parity drive) 34 (P 
Drive No.) in which the parity is stored, an Addr 35 in the 
PSCSI which is a physical address in the parity drive in 
which the parity is actually stored, a P cache address 36 
which shows the stored position of the parity when the parity 
is stored in the cache memory 7, and a P cache flag 37 which 
shows the existence of the parity in the cache memory 7. 

When the power supply is turned ON, the address table 40 
is read automatically, with no sensing of the CPU 1, into the 
cache memory 7 from a specified drive in the logical group 
10 by either microprocessor 20. On the other hand, when the 
power supply is turned OFF, the address table 40 stored in 
the cache memory 7 is automatically stored back into a 
predetermined position in the specified drive. 

(Writing process) 

A writing process includes the following: the updating of 
data in which a user designates a logical address to which the 
data is to be written and rewrites the data, and new writing 
of data in a space region. In the present embodiment, for 
simplification, a writing process for the update of data will 
be explained with reference to FIG. 4 to FIG. 7. 

When a write request is issued from the CPU 1, the 
following operation is executed under the control of either of 
the microprocessors 20. 
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A logical address and new data for writing, which are 
transferred from the CPU 1, are stored in the cache memory 
7 through a cache adapter circuit 23. The address in the 
cache memory at which the data is stored is registered in a 
cache address 31 in the address table 40 which corresponds 5 
to the logical address of the data. 

At this time, a cache flag 32 which corresponds to the 
logical address 27 is turned ON (1). 

When a further write request is issued from the CPU 1 for 
the new data held in the cache memory 7, the new data is 10 
rewritten. 

Either of the microprocessors 20, which receives the write 
request, recognizes drives 12 (designated by the data drive 
No. field 29 and the parity drive No. field 34 of the address 
translation table 40 shown in FIG. 3) in which the data and 15 
parities are stored, and Addr 30 in the SCSI and Addr 35 in 
the PSCSI, the physical address of the drive 12, from the 
logical address designated by the CPU 1, by referring to the 
address table 40. 

As shown in FIG. 5, when a write request is issued from 20 
the CPU 1 for the Data No. 1 in the drive 12 of SD No. 1 
to update the data to a New Data No. 1, the microprocessor 
20, after recognizing the physical addresses of Data No. 1, 
the data to be updated (old data), and of parity No. 1, the 
parity to be updated (old parity), by referring to the address 25 
table 40, reads out the old data and the old parity from 
respective drives (step 1 in FIG. 4 and FIG. 5). The read out 
old data and old parity are stored in the cache memory 7. 

An exclusive-OR operation is performed on the old data 3Q 
and the old parity, and new data to be written, to generate a 
New Parity No. 1, a new parity after updating, and the new 
parity is stored in the cache memory 7 (step (2) in FIGS. 4 
and 5). 

After the completion of storing of the new parity (New 35 
Parity No. 1) into the cache memory 7, the microprocessor 
20 writes new data (New Data No. 1) in the address of Data 
No. 1 in the drive 12 in SD No. 1 (step (3) in FIGS. 4 and 
5). The writing of the new data can be performed asynchro- 
nously under the control of the microprocessor 20. 40 

The characteristic of the present invention is that a new 
parity (New Parity No. 1) is not immediately written in the 
parity drive SD No. 5, but is put together with other new 
parities corresponding to other write requests and they are 
written en bloc in the parity drive. 45 

In the case where New Data No. 1 is registered in an entry 
having a logical address of Data No. 1 for the address table 
40 and the data is held in the cache memory 7, the address 
in the cache memory 7 is registered in the cache address 31, 
and the cache flag 32 is turned ON. About parity, the cache 50 
address is registered in the Pcache address 36, and the 
Pcache flag 37 is turned ON. 

As described above, in the address table 40, a parity 
whose Pcache flag is in the ON state is classified to be an 
updated parity, and a parity stored in a drive for storing 55 
parity is regarded to be invalid thereafter. 

In the present embodiment, as shown in FIG. 6, when the 
new data and the new parity from the CPU 1 are stored in 
the cache memory 7, the microprocessor 20 reports to the 6Q 
CPU 1 that the process of writing the data is finished. 

In a conventional method, after a new parity is written in 
a parity drive with a delay of one turn, the microprocessor 
20 reports to the CPU 1 that the writing process is finished. 

The writing of a new parity into the parity drive No. 5 is 65 
performed asynchronously under the control of the micro- 
processor 20, so that it cannot be seen by a user. 



,876 

8 

When the writing of a new data into the data drive No. 1 
is performed asynchronously under the control of the micro- 
processor 20, the completion of the writing is not seen by a 
user. 

Following the above, when a write request for other data, 
for example, Data No. 10 and Data No. 8 stored in data 
drives SD No. 2 and SD No. 4 is issued, these requests are 
processed in the same way as described above, and new data 
corresponding to the above-mentioned data is written in a 
data drive and new parities corresponding to them are stored 
in the cache memory 7. 

When a number of the new parities collected in the cache 
memory 7 exceeds a preset value set by a user, or when there 
is a period of time in which no read/write request is issued 
from the CPU 1, these new parities are written collectively 
in order into a parity drive SD No. 5 using a method to be 
explained later (step (4) in FIGS. 3 and 4). 

After such writing process is finished, Addr in SCSI in 
which new parities are actually written are registered in the 
Addr 35 in the PSCSI in the address table 40. 

(Collective writing of a plurality of new parities) 

In the present embodiment, a plurality of new parities held 
in the cache memory 7 are written into a parity drive in a 
sequential order. The positions in the parity drive into which 
individual new parities are written are different from the case 
of a conventional technique, wherein new parities are writ- 
ten in the positions where the old parities corresponding to 
new parities were held. In this regard, in the present embodi- 
ment, new parities are written into the positions determined 
by the order of access among a plurality of storage positions 
in which a plurality of old parities were held. 

The order of access is determined such that a plurality of 
storing positions which belong to a track are accessed in the 
order of the storage positions in the direction of rotation. 

In the present embodiment, a plurality of new parities held 
in the cache memory are written into a parity drive in the 
order of occurrence of the write requests which have caused 
the new parities to be generated. 

To be more specific, a detailed explanation will be given 
in the following. 

(1) In finding a plurality of tracks to which the old parities 
corresponding to the new parities belong, a group of the new 
parities held in the cache memory 7 are written in order into 
the positions of old parities according to the predetermined 
cylinder order and track order. In such case, to the positions 
of a plurality of old parities in a track, new parities are 
written in order following the order of the storage positions 
in the track. 

For example, in FIG. 7, there is shown a state where a 
group of old parities in the track 50 (No. 1) and track 51 (No. 
2) in the cylinder No. 1, and in the track 52 (No. 1) and track 
53 (No. 2) in the cylinder No. 2 are rewritten to a group of 
new parities in the cache memory 7. In this case, it is 
assumed that a new parity group, 



F15, F20, F3, F14,P'31. P'32, 
F10, F26. P'l. F16, F6, P13, F19. 
F23, F17, F9, ?12, F23. P*24, 
F25, F26, F30, F31, P32, 



is generated in this order, and a group of old parities 
corresponding to the new parities belong to tracks No. 1 and 
No. 2 in each of the cylinders No. 1 and No. 2. 

These new parities are written in order from the top into 
the position of an old parity in the track No. 1 in the cylinder 



04/21/2004, EAST Version: 1.4.1 



5,583 

9 

No. 1, and then the new parities are written into the positions 
of old parities in the order of storage positions in the track. 
Ia this case, new parities, P15, P20, P'3, P14 f P'31, and 
P'32, are written in the positions of old parities, PI, P2, P3, 
P6, P7, and P8. It is possible to finish the writing process as 5 
described in the above in a single turn of a parity drive. 

(2) A part of the rest of the new parities are written in a 
predetermined next track in the same cylinder No. 1, in this 
case, they are written in the track No. 2. 

In the example shown in FIG. 7, new parities, P10, P'26, 10 
PI, P16, P'6, P'13, and P19, are written in the track No. 2 
in a sequential order. 

(3) A part of the rest of the new parities are written in the 
next cylinder, in this example, in a track in the No. 2 
cylinder, in this case, at a position in the track No. 1. 15 

In the example shown in FIG. 7, new parities P'23, P17, 
P'9, P'22, P'23, and P*24, are written in the track No. 1 in 
order. 

(4) A part of the rest of the new parities are written in the 
same way into the next track in the same cylinder, in this 20 
case, in the track No. 2. 

In the example shown in FIG. 7, new parities, P'25, F26, 
P'30, P'31, and P'32, are written in the track No. 2 in order. 

In the way described above, a group of new parities can 
be written in a short search time into a group of storing 25 
positions to which a group of old parities belong. 

In a conventional method, the Addr 35 in the PSCSI for 
storing parities in a parity drive is made to be identical with 
the Addr 30 in the SCSI for storing data in the drive 12 for 
storing data. In the present embodiment, however, a plurality 30 
of new parities are written in different positions from the 
storing positions of the corresponding old parities, that is, 
they are written in the storing positions of the old parities in 
the access order of the positions; thereby a plurality of new 
parities can be written in a short time. 35 

As described above, when parities are written sequen- 
tially, after a parity has been written, in order to prevent a 
block which is necessary for writing the next parity to be 
passed by while a process to start the writing of the next 
parity is performed, it is necessary to have enough of a sector 40 
gap available (On this point, refer to, for example, 'Tran- 
sistor Technology, Special No. 27' CQ Publishing Co., 1, 
May, 1991, p 20.). 

In a case where a new write request is issued from the 
CPU 1 while new parities collected in the cache memory 7 45 
are being written sequentially, collectively, the above-men- 
tioned writing process is continued and after the writing 
process is finished, the new write request is processed. 

(Reading process) 

The reading process is basically the same as the method 50 
known to the public. When a read request issued from the 
C2U 1 is processed, similar to the case where a write request 
is processed, a logical address designated by the request is 
translated to a physical address and the desired data is read 
from one of the data drives, and the read data is transf erred 55 
to the CPU 1. Provided that, when the data is held in the 
cache memory 7, the data is transferred from the cache 
memory 7 to the CPU 1. 

(Failure reconstruction process) 

When a failure occurs in one of the drives, a method for 60 
reconstructing the data in the failed drive will be explained. 

For example, as shown in FIG. 5, it is assumed that a 
Parity No. 1 in a drive 12 in SD No. 5 is composed of Data 
No. 1 in a drive 12 in SD No. 1, Data No. 2 in a drive 12 in 
SD No. 2, Data No. 3 in a drive 12 in SD No. 3, and Data 65 
No. 4 in a drive 12 in SD No. 4. In the same way, it is 
assumed that Parity No. 2 is composed of Data Nos. 5, 6, 7 
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and 8, and Parity No. 3 is composed of Data Nos. 9, 10, 11 
and 12. 

When a failure occurs at any one of the drives SD Nos. 1, 
2, 3 and 4, for example, at SD No. 1, all the data in the failed 
drive SD No. 1 can be reconstructed from the data in the rest 
of the drives, SD No. 2. SD No. 3, and SD No. 4 and a parity 
in the drive No. 5. 

In accordance with the present invention, in the result of 
the collective writing of the plurality of parities, in general, 
the addresses in the drives SD No. 1 to SD No. 4 of a 
plurality of units of data which belong to the same parity 
group do not coincide with the addresses of parities in the 
parity group in the parity drive SD No. 5. 

Therefore, when the data in the failed drive SD No. 1 is 
to be reconstructed, either of the microprocessors 20 reads 
out all parities in the parity drive SD No. 5 in the cache 
memory 7. In mat case, the microprocessor 20 registers the 
addresses in the cache memory 7, which stores the read out 
parities to the Pcache address 36 in the address table 40, and 
turns the Pcache flag 37 ON. 

Next, the microprocessor 20 reads out three units of 
normal data which belong to one of the parity groups in 
drives 12 in SD Nos. 2, 3 and 4, for example, Data No. 2, 
Data No. 3 and Data No. 4, in order and searches the cache 
address for a parity which belongs to the same parity group 
with these units of data from the address table 40. A parity 
designated by the located address, and the above-mentioned 
three units of normal data are sent to a parity generating 
circuit (PG) 25 (FIG. 1) and a unit of data in the failed drive 
SD No. 1 which belongs to the parity group, for example, 
Data No. 1 is reconstructed. In the same way, other units of 
data in the failed drive SD No. 1, for example, Data Nos. 5, 
9, etc. are reconstructed. 

After the failed drive SD No. 1 is replaced by a normal 
drive, the reconstructed data is written to the normal drive; 
thus, the reconstruction process is completed. 

In a case where a standby normal drive is provided 
beforehand for the occurrence of a failure in the drives Nos. 
1 to 5, the reconstructed data is "written in the standby normal 
drive. 

In the present embodiment, data, except for parity, which 
belongs to one parity group is held at the same address in 
different data drives, but the address of a parity in a parity 
drive differs from that in the former. When a failure occurs 
in one of the data drives, the parity data which corresponds 
to the data in a different parity group read from respective 
data drives in order have to be read from a random position 
of a parity drive. In the present embodiment, in place of a 
random reading, it is arranged that all parities are read from 
parity drives and stored in the cache memory 7, and the 
parities which correspond to respective parity groups read 
from a plurality of data drives are utilized by retrieving them 
from the cache memory 7. Thereby, when a failure is to be 
reconstructed, the utilization of parities can be performed at 
a high speed. 

(Modification of Embodiment 1) 

In the above explanation, the cache memory 7 which 
stores updated new parities is assumed to be a nonvolatile 
semiconductor memory. However, a parity, different from 
data, can be reconstructed even if it is erased by an inter- 
ruption of electricity or other factors, so that if the overhead 
for regenerating parity is permitted, it is possible to consti- 
tute a region for storing old parities in the cache memory 7 
with a volatile semiconductor memory. 

In the above explanation, updated new parities are stored 
in the cache memory 7; however, it is also possible to 
provide an exclusively used memory for storing them. 
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(Embodiment 2) 

There is another method of collective writing of a plu- 
rality of new parities in accordance with the present inven- 
tion. As shown in FIG. 8. effective parities P8 and P9, which 5 
are not to be updated by a write request, are read in the cache 
memory 7 in response to an instruction from the micropro- 
cessor 20, and according to the read parity, the micropro- 
cessor 20 sets the Pcache address 36 in the address table 40 
and turns the Pcache flag 37 ON; thereby, the above- 1Q 
mentioned effective parities are regarded to be new parities 
to be updated, and they can be put together with the other 
new parities for sequential, collective-writing. 

For example, as shown in FIG. 8, a group of new parities, 
as explained with reference to FIG. 7, are held in the cache 15 
memory 7, and after that old parities P4, P5, P12, P18, etc. 
for which there are no write requests and for which origi- 
nally there was no need for an update, are regarded to be a 
parity group to be updated and they are held in the cache 
memory 7. The parity group which is held in the cache 2 o 
memory 7, as described above is written in order into a track 
to which the new parities belong. In the example shown in 
FIG. 8, following the parity groups P4, P5, P12, and so on, 
which did not need rewriting originally, there are generated 
new parity groups, P'lS and so on, which are written in order 25 
into the track No. 1 and track No. 2 in the cylinder No. 1 and 
then into the track No. 1 and track No. 2 in the next cylinder 
No. 2. 

In the embodiment 2, there is a need for a surplus process 
in that the parities which do not need updating originally 30 
have to be read from a drive; however, as opposed to the 
embodiment 1, the read out parities and the generated new 
parities can be written collectively into respective tracks in 
order, so that the write control becomes simpler than that in 
the embodiment 1. The method shown in the present 35 
embodiment, in the same way as the method in the embodi- 
ment 1, can be applied to any of the following embodiments. 

(Embodiment 3) 

40 

In the present embodiment, as shown in FIG. 9, a parity 
drive is divided into a plurality of regions, and the collec- 
tive-writing of parities as described in the above is per- 
formed by the units of regions. 

The dividing of a region is performed by the Addr 30 in 45 
the SCSI. 

For example, the Addr 30 in the SCSI designates a 
plurality of cylinders which belong to a group of cylinders, 
from one cylinder to another cylinder, to be a region Dl. 

In the drives 12 in the SD Nos. 1, 2, 3 and 4, the Addr 30 50 
in the SCSI designates that the parity corresponding to the 
parity groups which belong to these cylinders, that is, the 
Addr 36 in the PSCSI of the drive 12 in the SD No. 5, is 
stored in a region to which these cylinders belong. 

As described above, in the address table 40, the region to 53 
which the parity belongs and the region to which the data 
belong are made to correspond to each other. 

In a case where such dividing of a region is executed, 
when the updating of a parity which belongs to the region 60 
Dl is performed by a data write request, the new parity is 
held in the cache memory 7 as a parity in the region Dl. 

In the same way, new parities generated by other write 
requests from the CPU 1 are held in the cache memory 7, 
and the new parities which are held as parities in the region 65 
Dl are written into the region Dl sequentially, and collec- 
tively. 
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For the other regions D2, D3 and D4, in the same way as 
described above, new parities which belong to respective 
regions are written sequentially, and collectively, into 
respective regions using the method described with refer- 
ence to the embodiments 1 or 2. 

In the present embodiment, when reconstruction of data 
for a failed drive is to be performed, the same process as that 
in the embodiment 1 can be performed in each parity group 
to which parities in each parity region belong. In other 
words, all parities in each parity region are read in the cache 
memory 7, and then the reconstruction of data in the parity 
group to which the parities belong can be performed. The 
same process will be performed in order for different 
regions. 

As a result, in the present embodiment, when a failure is 
reconstructed, the total quantity of parities to be read in the 
cache memory 7 is smaller than those in the embodiment 1. 

(Embodiment 4) 

In the present embodiment, as shown in FIG. 10, the 
function of a drive for the regions Dl, D2, D3 and D4 to 
which parities are written is not limited to the function of 
one drive which is used exclusively for parities, but the 
function can be distributed to a plurality of drives SD No. 1 
to SD No. 4, which constitute a logical group 10. The other 
points except the above are the same as in the case of the 
embodiment 3. 

(Embodiment 5) 

In the present embodiment, when a failure occurs in any 
of the parity drives, it is possible to expedite the reconstruc- 
tion of a parity in the failed parity drive utilizing the 
collective-writing described with reference to the embodi- 
ment 1. 

In other words, the collective-writing of parities described 
in the embodiment 1 is executed in parallel in a plurality of 
parity drives contained in different logical groups, and 
further, in parallel to the above operation, a different parity 
is generated from the parities in these parity drives and is 
written into the other parity drive, which is provided in 
common to the other parity drives. When a failure occurs in 
any of the plurality of parity drives, the parity in the failed 
parity drive is reconstructed using the new parity and normal 
parities in the parity drives. Heretofore, to reconstruct a 
failed parity drive, data has been read from all data drives in 
a logical group to which the parity drive belongs, and new 
parities have been generated from them. In the present 
embodiment, a failed parity drive can be reconstructed faster, 
than in the conventional case. 

To be more specific, in the logical group No. 1 shown in 
FIG. 9, updated new parities PI, P'2, . . . , corresponding to 
parities PI, P2, . . . , are held in the cache memory 7 in this 
order using the method described with reference to the 
embodiment 1. Similar to the above, in the logical group No. 
2, updated new parities P'4, P*3, . . . , corresponding to 
parities P3, P4, . , . , are held in this order in the cache 
memory 7. Similar to the above, in the logical group No. 3, 
updated new parities P'6, P'5, .... corresponding to parities 
P6, P5, . . . , are held in the cache memory 7 in this order. 
Similar to the above, in the logical group No. 4, updated new 
parities P7, P'8, . . . , corresponding to parities, P7, P8, , . 
. , are held in the cache memory 7 in this order. These new 
parities are written in order to respective drives SD No. 6, 
SD No. 12, SD No. 18 and SD No. 24 similar to the case of 
the embodiment 1. 
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In the present embodiment, a new parity PP'l is generated 
from the new parities PI, P'4, P'6 and P7 in the logical 
groups No. 1 to No. 4. Similar to the above, new parity PF2 
is generated from P'2, P*3, P'5 and P'8. These new parities 
are stored in another drive 200 which is provided in common 5 
to these logical groups. When a failure occurs in any of the 
parity drives SD No. 6, SD No. 12, SD No. 18 or SD No. 24, 
a parity in the failed drive is found from the rest of the 
normal drives and a common drive 200; thus, parities in the 
failed drive are regenerated. io 

(Embodiment 6) 

FIG. 12 shows the hardware constitution of the present 
embodiment, the same symbols or reference numerals as 
those in FIG. 1 designate the same elements. The differences 
in the device shown in FIG. 12 from the device shown in 
FIG. 1 are: in place of a drive for holding parities, as shown 
in FIG. 1, a flash memory (FMEM) 41 is provided as a 
memory for holding parities in logical groups 10. The flash 
memory 41 is composed of a flash memory controller 
(FMEMC) 42 and a plurality of flash memory chips 40. In 
addition to this, in the present embodiment, in place of a 
parity drive No., P Drive No., and the Addr 35 in the address 
P SCSI in a drive, a flash memory No. and ah Addr in the 
flash memory are used. 

Only the different points between the present embodiment 
and the embodiment 1 will be explained. 

In the present embodiment, the process of writing or 
reading data is the same as that in the embodiment 1. The 
collective-writing of parities in this embodiment differs from 
that in the embodiment 1 in that an erasing of the flash 
memory 41 is employed. 

Before new parities are written collectively, and sequen- 
tially, into the flash memory 41, either of the microproces- 
sors 20 examines the quantity of the new parities to be 
written, and erases the addresses corresponding to the exam- 
ined quantity in the flash memory chips at one time in which 
invalid parities (PI, P3 and P2) are written. After that, 
similar to the embodiment 1 , new parities are written col- 
lectively, and sequentially. 

In the case of a flash memory, when new data is written 
into the flash memory, at first old data stored in the address 
at which new data is to be written is erased, and after the 
erase process is completed, the new data is actually written. 
In the case of a flash memory, it takes the same period of 
time to erase one sector (When a flash memory is accessed, 
the address of the same format is used as the address when 
a disk is accessed.) and to erase a plurality of sectors at one 
time. A greater part of a write time is occupied by an erase 50 
time, and an actual write time to a flash memory is negligibly 
small in comparison with the " erase time. Owing to the 
sequential, collective writing, a characteristic of the present 
embodiment, an erase process can be completed at one time, 
so that the greater the number of collected new parities is, 55 
the smaller the overhead can be made. 

. (Embodiment 7) 

In a device described in the embodiment 6, it is also $o 
possible to execute collective-writing of parities by the 
method shown in the embodiment 2. 

In this case, similar to the embodiment 2, parities other 
than the parities to be updated are also read out from the 
flash memory 41, and together with the generated new 65 
parities they are written collectively to the memory. In the 
present embodiment, as opposed to the embodiment 2, in a 
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period of time between the read out and the collective 
writing, an erase operation is executed at one time for the 
positions of old parities to be updated, which are stored in 
the flash memory, and for the positions in which the above- 
mentioned read out parities are held. 



(Embodiment 8) 

A flash memory, in comparison with a disk drive, has the 
merit that even if it is accessed at random, the access can be 
processed in a short time; on the other hand, the number of 
writing times has a limit. Therefore, if writing is concen- 
trated to a single address, a part of the flash memory may 
reach the limit of the number of times it is capable of being 
written. After that, writing to the flash memory may become 
impossible. 

In the present embodiment, in order to reduce the prob- 
lem, when new parities are sequentially, collectively written 
to a plurality of flash memory chips 40 which constitute the 
flash memory 41, a chip to be written is changed in a regular 
order to average the number of times of writing to all flash 
memory chips 40. 

To be specific, as shown in FIG. 13, it is assumed that the 
flash memory 41 is composed of n pieces of flash memory 

chips, Nos. 1, 2, 3, 4 n, and addresses are from 0000 

through ffff. When either of the microprocessor 20 recog- 
nizes that the sequential, collective- writing of new parities is 
to be performed, it examines the last address among the 
addresses in which new parities are stored, when the col- 
lective-writing was performed in the preceding writing 
times. 

For example, in the collective-writing in the preceding 
times, assuming that new parities are written to the addresses 
from 0000 through aaaa in the flash memory 41, the micro- 
processor 20 stores the address aaaa. When the next sequen- 
tial, collective-writing is to be performed, the microproces- 
sor 20 examines the stored address (aaaa). As described 
above, if the microprocessor stores the last address used in 
the preceding writing times, it judges that the next sequential 
collective-writing of new parities is to be started from the 
address next to the address aaaa. 

When the microprocessor 20 recognizes the head address 
of the sequential collective-writing of new parities as 
described above, in the next step, it judges if the number of 
times of writing to the flash memory 41 reaches the limit 

As shown in the judgment flowchart for the number of 
writing times in FIG^ 14, the microprocessor 20 determines 
whether the head address of a sequential, collective-writing 
of new parities is 0000 (51). If it is not 0000, the judgment 
flow is finished (52), and if it is 0000, the microprocessor 20 
adds 1 to the counter value of the number of writing times 
counter (53). In other words, every time a head address 
comes, the counter value is increased. 

Next, the microprocessor 20 judges if the counter value 
plus 1 corresponds to the preset limiting value of the number 
of writing times for a flash memory chip 40 (54). The 
limiting value of the number of times of writing for a flash 
memory chips 40 is set by a user for the microprocessor 20 
during initialization. 

In the judgment result, when the number of writing times 
to the flash memory chip 40 does not exceed the limiting 
value, the judgment flow is finished (52), and when the 
number of writing times exceeds the limiting value, the 
microprocessor 20 indicates the need for flash memory chin 
(55). 
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As described above, in the present embodiment, sequen- 
tial collective- writing is executed form a lower address to a 
higher address in one direction. Thereby, the numbers of 
writing times to all flash memory chips 40 in the flash 
memory 41 are averaged. 5 

In the present embodiment, as described above, the num- 
bers of writing times to the flash memory chips 40 in the 
flash memory 41 are averaged. 



(Embodiment 9) 
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In the present embodiment, as shown in the embodiment 
6, new parities are not sequentially, collectively written to 
the parity flans memory chips for storing parities, but new 
parities are written to a plurality of flash memory chips 40 15 
for storing parities in parallel. The present embodiment is 
constituted by applying the method shown in the embodi- 
ment 5 to the device shown in the embodiment 6; therefore, 
a detailed explanation thereof will be omitted. 

What is claimed is: 20 

1. A method of rewriting error correcting codes in a disc 
array device which includes a plurality of disc devices which 
hold a plurality of error correcting data groups, each of 
which includes a plurality of data and at least one error 
correcting code generated therefrom, the method comprising 25 
the steps of: 

in response to a data rewrite request provided by an upper 
level device connected to said disc array device, read- 
ing old data designated by the data rewrite request from 3Q 
said plurality of disc devices; 

reading, from said plurality of disc devices, an old error 
correcting code which belongs to one group within said 
plurality of error correcting data groups, to which one 
group the old data belongs; 35 

generating a new error correcting code for the one error 
correcting data group, after the old data has been 
rewritten by new data designated by the data rewrite 
request, from the read old data, said old error correcting 
code and the new data; 40 

rewriting the old data held in said plurality of disc devices 
by said new data; 

temporarily holding the generated new error correcting 
code in a random access memory provided in said disc 
array device; 45 

repeating said step of reading old data to said step of 
temporarily holding the new error correcting code for 
each of a plurality of other data, rewrite requests sub- 
sequently provided by said upper level device, thereby 5Q 
storing a group of new error correcting codes for a 
group of data rewrite requests in said memory; and 

sequentially writing the group of new error correcting 
codes held in said memory into a group of storage 
locations within said plurality of disc devices, at which 55 
storage locations a group of old error correcting codes 
have been held, according to an order of access prede- 
termined for storage locations within the disc devices; 

wherein the order of access is predetermined so that a 
plurality of storage locations which belong to the same 60 
track within one of said disc devices are sequentially 
accessed according to an order of locations within the 
track. 

2. A method of rewriting error correcting codes according 

to claim 1, wherein the order of access is predetermined so 65 
that a plurality of storage locations belonging to the same 
cylinder of one of said disc devices are accessed sequentially 



according to an order predetermined for tracks to which the 
plurality of storage locations belong, and so that a plurality 
of storage locations belonging to different cylinders of one 
of said disc devices are accessed sequentially according to 
an order predetermined for the cylinders. 

3. A method of rewriting error correcting codes according 
to claim 1, wherein the group of. new error correcting codes 
held in said memory are sequentially written into the group 
of storage locations according to an order of generation of 
corresponding ones of the group of data rewrite requests 
issued by said upper level device. 

4. A method of rewriting error correcting codes according 
to claim 1, wherein the plurality of disc devices are divided 
into a plurality of data holding disc devices and at least one 
error correcting code holding disc device; and 

wherein the reading of the group of old error correcting 
codes and the writing of the group of new error 
correcting codes are carried out in said error correcting 
code holding disc device. 

5. A method of rewriting error correcting codes according 
to claim 4, wherein the order of access is predetermined so 
that a plurality of storage locations belonging to the same 
cylinder of one of said disc devices are accessed sequentially 
according to an order predetermined for tracks to which the 
plurality of storage locations belong, and so that a plurality 
of storage locations belonging to different cylinders of one 
of said disc devices are accessed sequentially according to 
an order predetermined for the cylinders. 

6. A method of rewriting error correcting codes according 
to claim 4, wherein the group of new error correcting codes 
held in said memory are sequentially written into the group 
of storage locations according to an order of generation of 
corresponding ones of the group of data rewrite requests 
issued by said upper level device. 

7. A method of rewriting error correcting codes according 
to claim 4, 

wherein said error correcting code holding disc device 
includes a plurality of areas; 

wherein the group of new error correcting codes are held 
in said memory as a plurality of new error correcting 
code partial groups, each partial group corresponding 
to one of said areas and each partial group including 
plural new error correcting codes which correspond to 
plural old error correcting codes held at plural storage 
locations, each of which belong to the same one of said 
areas; 

wherein the step of sequentially writing the group of new 
error correcting codes is executed, so that new error 
correcting codes belonging to different error correcting 
code partial groups are sequentially written, and so that 
new error correcting codes belonging to each partial 
group are written sequentially, according to said order 
of access, into a group of storage locations which 
belong to one of said areas corresponding to each 
partial group, and at which storage locations a plurality 
of old error correcting codes corresponding to a plu- 
rality of new error correcting codes belonging to each 
partial group are held. 

8. A method of rewriting error correcting codes according 
to claim 7, wherein the order of access is predetermined so 
that a plurality of storage locations belonging to the same 
cylinder within one of said areas within said error correcting 
code holding disc device are accessed sequentially accord- 
ing to an order pred^ermined for tracks to which the 
plurality of storage locations belong, and so that a plurality 
of storage locations belonging to different cylinders within 
said one area are accessed sequentially according to an order 
predetermined for the cylinders. 
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9. A method of rewriting error correcting codes according 
to claim 7, wherein the step of sequentially writing the group 
of new error correcting codes is executed so that a plurality 
of new error correcting codes belonging to each partial 
group and held in said memory are sequentially written into 5 
a group of storage locations within one of said areas corre- 
sponding to each partial area, according to an order of 
generation of corresponding ones of a group of data rewrite 
requests issued by said upper level device. 

10. A method of rewriting error correcting codes accord- 
ing to claim 1, wherein the plurality of disc devices include 10 
a plurality of error correcting code holding areas provided in 
different ones of said disc devices; 

wherein the group of new error correcting codes are held 
in said memory as a plurality of new error correcting 
code partial groups, each partial group corresponding 15 
to one of said areas and each partial group including 
plural new error correcting codes which correspond to 
plural old error correcting codes held at plural storage 
locations, each of which belong to the same one of said 
areas; 20 

wherein the step of sequentially writing the group of new 
error correcting codes is executed, so that new error 
correcting codes belonging to different error correcting 
code partial groups are sequentially written, and so that 
new error correcting codes belonging to each partial 25 
group are written sequentially, according to said order 
of access, into a group of storage locations which 
belong to one of said areas corresponding to each 
partial group, and at which storage locations a plurality 
of old error correcting codes corresponding to a plu- 30 
rarity of new error correcting codes belonging to each 
partial group are held. 

11. A method of rewriting error correcting codes accord- 
ing to claim 10, wherein the order of access is predetermined 
so that a plurality of storage locations belonging to the same 
cylinder within one of said areas within one of said disc 
devices are accessed sequentially according to an order 
predetermined for tracks to which the plurality of storage 
locations belong, and so that a plurality of storage locations 
belonging to different cylinders within said one area are ^ 
accessed sequentially according to an order predetermined 
for the cylinders. 

12. A method of rewriting error correcting codes accord- 
ing to claim 7, wherein a plurality of new error correcting 
codes belonging to each partial group and held in said 45 
memory are sequentially written into a group of storage 
locations within one of said areas corresponding to each 
partial area, according to an order of generation of corre- 
sponding ones of a group of data rewrite requests issued by 
said upper level device. 

13; A method of rewriting error correcting codes accord- 50 
ing to claim 1, further comprising: 
reading, from said plurality of disc devices to said 
memory, error correcting codes other than a group of 
old error correcting codes corresponding to the group 55 
of new error correcting codes held in said memory, 
after the repeating step; 
wherein the step of sequentially writing the group of new 
error correcting codes is executed so that the group of 
new error correcting codes and said read other error 60 
correcting codes are sequentially written into a group of 
storage locations of said plurality of disc devices 
according to the order of access. 
14. A method of rewriting error correcting codes accord- 
ing to claim 13, wherein each of the storage locations is one 65 
which holds either one of the group of new error correcting 
codes or one of the other group of error correcting codes. 
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15. A method of rewriting error correcting codes accord- 
ing to claim 4, further comprising: 

reading, from said error correcting code holding disc 
device to said memory, error correcting codes other 
than a group of old error correcting codes correspond- 
ing to the group of new error correcting codes held in 
said memory, after the repeating step; 

wherein the step of sequentially writing the group of new 
error correcting codes is executed so that the group of 
new error correcting codes and said read other error 
correcting codes are sequentially written into a group of 
storage locations of said error correcting code holding 
disc devices according to the order of access. 

16. A method of rewriting error correcting codes accord- 
ing to claim IS, wherein each of the storage locations is one 
which holds either one of the group of new error correcting 
codes or one of the other group of error correcting codes. 

17. A method of rewriting error correcting codes accord- 
ing to claim 7, further comprising: 

reading, from each of the areas of said error correcting 
code holding disc device to said memory, error cor- 
recting codes other than a group of old error correcting 
codes corresponding to the partial group of new error 
correcting codes held in said memory in correspon- 
dence to said each area, after the repeating step; 

wherein the step of sequentially writing the group of new 
error correcting codes is executed so that the partial 
group of new error correcting codes and said read other 
error correcting codes held in said memory in corre- 
spondence to each area are sequentially written into 
storage locations within each area of said error correct- 
ing code holding disc device according to the order of 
access, wherein each of the storage locations within 
each area is one which holds either one of the partial 
group of new error correcting codes for each area or 
one of the other group of error correcting codes for each 
area. 

18. A method of rewriting error correcting codes accord- 
ing to claim 10, further comprising: 

reading, from each of the error correcting code holding 
areas within said plurality of disc devices to said 
memory, error correcting codes other than a group of 
old error correcting codes corresponding to the group 
of new error correcting codes held in said memory in 
correspondence to each error correcting code holding 
area, after the repeating step; 

wherein the step of sequentially writing the group of new 
error correcting codes is executed so that the group of 
new error correcting codes for each error correcting 
code holding area and said read other error correcting 
codes held in said memory in correspondence to each 
error correcting code holding area are sequentially 
written into storage locations within each error correct- 
ing code holding area, according to the order of access, 
wherein each of the storage locations within each area 
is one which holds either one of the group of new error 
correcting codes for each error correcting code holding 
area or one of the other group of error correcting codes 
for each error correcting code holding area. 

19. A method of recovering error correcting codes accord- 
ing to claim 4, further comprising the steps of: 

reading a plurality of error correcting codes from said 
error correcting code holding disc, at the occurrence of 
a fault during accessing of one of said plurality of data 
holding disc devices; 

sequentially reading a plurality of groups of data from 
others of said data holding discs, other than said one 
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faulty data holding disc, each group of data comprising 
data belonging to the same error correcting data group 
and being held at mutually the same address within said 
other data holding disc devices, said groups being read 
sequentially group by group; 5 

selecting, from said plurality of error correcting codes as 
read, an error correcting code which should belong to 
the same error correcting data group as one to which 
each group within said groups of data as read belongs; 
and 10 

recovering, for each group of data, data held in said faulty 
disc device and belonging to the same error correcting 
data group, from said error correcting code selected for 
each group and said read group of data. 

20. A method of recovering error correcting codes accord- 15 
ing to claim 4, 

wherein said disc array device further includes; 

a plurality of other disc devices including a plurality of 
other data holding disc devices and at least one other 
error correcting code holding disc device for said other 20 
data holding devices; and 

a common error correcting code holding disc device 
provided for said plurality of disc devices and said 
plurality of other disc devices; ^ 

wherein the method further comprises the steps of: 

in response to another group of data rewrite requests 
provided by said upper level device, executing the step 
of reading old data to said step of sequentially writing 
a plurality of old data and a plurality of new data, both 30 
related to said another group of data rewrite requests 
and held in said plurality of other disc devices, thereby 
generating another group of new error correcting codes 
for said plurality of other disc devices, and sequentially 
writing the generated another group of new error cor- 35 
recting codes into said other error correcting code 
holding disc device; 

generating a still other group of new error correcting 
codes from the group of new error correcting codes 
generated for said plurality of disc devices and from 40 
said another group of new error correcting codes gen- 
erated for said plurality of other disc devices; 

writing the generated still other group of error correcting 
codes into said common error correcting code holding 
disc device; 45 

sequentially reading a group of error correcting codes 
from one of said error correcting code holding disc 
device and said other error correcting code holding disc 
device, at the occurrence of fault with another one of 5Q 
said error correcting code holding disc device and said 
other error correcting code holding disc device; 

sequentially reading said still other group of . error cor- 
recting codes from said common error correcting code 
holding disc device; and 55 

recovering a group of error correcting codes held in said 
faulty error correcting code holding disc device, from 
said group of error correcting codes read from said one 
error correcting code holding device and said still other 
group of error correcting codes read from said common 60 
error correcting code holding device. 

21. A method of rewriting error correcting codes in a disc 
array device which includes a plurality of data holding disc 
devices and holds a plurality of error correcting data groups, 
each of which includes a plurality of data and at least one 65 
error correcting code generated therefrom, the method com- 
prising the steps of: 
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in response to a data rewrite request provided by an upper 
level device connected to said disc array device, read- 
ing old data designated by the data rewrite request from 
one of said plurality of disc devices; 

reading an old error correcting code which belongs to one 
group within said plurality of error correcting data 
groups to which one group the old data belongs, from 
a flash memory provided in said disc array device; 

generating a new error correcting code for the one error 
correcting data group, after the old data has been 
rewritten by new data designated by the data rewrite 
request, from the read old data, said old error correcting 
code and the new data; 

rewriting the old data held in said one disc device by said 
new data; 

temporarily holding the generated new error correcting 
code in a random access memory provided in said disc 
array device; 

repeating said step of reading old data to said step of 
holding a new error correcting code for each of a 
plurality of other data rewrite requests subsequently 
provided by said upper level device, thereby storing a 
group of new error correcting codes for a group of data 
rewrite requests into said random access memory; 

erasing a group of old error correcting codes which 
correspond to said group of new error correcting codes 
from a group of storage locations within said flash 
memory, after the repeating step; and 

sequentially writing the group of new error correcting 
codes held in said random access memory into the 
group of storage locations within said flash memory 
according to an order of access predetermined for 
storage locations within said flash memory. 

22. A method of rewriting error correcting codes accord- 
ing to claim 21, wherein the group of new error correcting 
codes held in said random access memory are sequentially 
written into the group of storage locations according to an 
order of generation of corresponding ones of the group of 
data rewrite requests issued by said upper level device. 

23. A method of rewriting error correcting codes accord- 
ing to claim 21, 

wherein said flash memory includes a plurality of areas; 

wherein the group of new error correcting codes are held 
in said random access memory as a plurality of new 
error correcting code partial groups, each partial group 
corresponding to one of said areas and each partial 
group including plural new error correcting codes 
which correspond to plural old error correcting codes 
held at plural storage locations each of which belong to 
the same one of the areas; 

wherein the step of sequentially writing the group of new 
error correcting codes is executed so that error correct- 
ing codes belonging to different error correcting code 
partial groups are written sequentially, and so that new 
error correcting codes belonging to each partial group 
are written sequentially, according to said order of 
access, into a group of storage locations which belong 
to one of said areas corresponding to each partial group 
and at which storage locations a plurality of old error 
correcting codes corresponding to a plurality of new 
error correcting codes belonging to each partial group 
are held. 

24. A method of rewriting an error correcting codes 
according to claim 21, further comprising: 

reading, from said flash memory to said random access 
memory, error correcting codes other than a group of 
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old error correcting codes corresponding to the group 
of new error correcting codes held in said random 
access memory, after the repeating step; and 
erasing the group of old error correcting codes and said 
other error correcting codes from said flash memory, 5 
after the reading of the latter; 
wherein the step of sequentially writing the group of new 
error correcting codes is executed so that the group of 
new error correcting codes and said read other error 
correcting codes are sequentially written into a group of io 
storage locations of said flash memory according to the 
order of access. 
25. A method of rewriting error correcting codes in a disc 
array device which includes a plurality of data holding disc 
devices and which holds a plurality of error correcting data 15 
groups, each of which includes a plurality of data and at least 
one error correcting code generated therefrom, the method 
comprising the steps of: 
in response to a data rewrite request provided by an upper 
level device connected to said, disc array device, read- 20 
ing old data designated by the data rewrite request from 
one of said plurality of devices; 
reading an old error correcting code which belongs to one 
group within said plurality of error correcting data 
groups, to which one group the old data belongs, from 
a flash memory provided in said disc array device; 
generating a new error correcting code for the one error 
correcting data group after the old data has been 
rewritten by new data designated by the data rewrite 
request, from the read old data, said old error correcting 
code and the new data; 
rewriting the old data held in said one disc device by said 
new data; 

temporarily holding the generated new error correcting 
code in a random access memory provided in said disc 35 
array device; 

repeating said step of reading old data to said step of 
temporarily holding the generated new error correcting 
code for each of a plurality of other data rewrite 
requests subsequentially provided by said upper level 40 
device, thereby storing a group of new error correcting 
codes for a group of data rewrite requests into said 
random access memory; and 

executing an erasing operation, after the repeating step,.to 
a group of storage locations within said flash memory 4 5 
which have successive addresses and hold invalid 
information; 

sequentially writing the group of new error correcting 
codes held in said random access memory into the 
group of storage locations according to a predetermined 50 
order of addresses; and 

repeating the step of reading old data to the step of 
sequentially writing the group of new error correcting 
codes into said flash memory for different groups of 55 
data rewrite requests provided by said upper level 
device; 

wherein one group within different groups of storage 
locations within said flash memory which have succes- 
sive addresses and which hold invalid information is go 
selected from said flash memory at the writing of a 
group of new error correcting codes generated for each 
group within the other groups of data rewrite requests 
during the last mentioned repetition. 
26. A disc array device, comprising: ^ 
a plurality of disc devices which hold a plurality of error 
correcting data groups, each of which includes a plu- 



876 

22 

rality of data and at least one error correcting code 

generated therefrom; 
a disc array controller connected to said plurality of disc 

devices and an upper level device, said disc array 

controller including a random access memory and a 

control device; 
said control device including: 

means responsive to a data rewrite request provided by 
said upper level device for reading old data designated 
by the data rewrite request from said plurality of disc 
devices; 

means for reading, from said plurality of disc devices, an 
old error correcting code which belongs to one group 
within said plurality of error correcting data groups to 
which one group the old data belongs; 

means for generating a new error correcting code for the 
one error correcting data group, after the old data has 
been rewritten by new data designated by the data 
rewrite request, from the read old data, said old error 
correcting code and the new data; 

means for rewriting the old data held in said plurality of 
disc devices by said new data; 

means for writing the generated new error correcting code 
into the random access memory; 

means for repetitively operating said first mentioned 
reading means to said last mentioned writing means for 
each of a plurality of other data rewrite requests sub- 
sequently provided by said upper device, thereby stor- 
ing a group of new error correcting codes for a group 
of data rewrite requests in said random access memory; 
and 

means for sequentially writing the group of new error 
correcting codes held in said random access memory 
into a group of storage locations within said plurality of 
disc devices, at which storage locations a group of old 
error correcting codes have been held, according to an 
order of access predetermined for storage locations 
within said disc devices; 

wherein the order of access is predetermined so that a 
plurality of storage locations which belong to the same 
tiack within one of said disc devices are sequentially 
accessed according to an order of locations within the 
track. 

27. A disc array device, comprising: 

a plurality of data holding disc devices which hold a 
plurality data belonging to a plurality of error correct- 
ing data groups each of which includes a plurality of 
data and at least one error correcting code generated 
therefrom; 

a flash memory for holding a plurality of error correcting 
codes for the plurality of error correcting data groups; 
and 

a disc array controller connected to said plurality of disc 
devices, said flash memory and an upper device, said 
disc array controller including a random access 
memory and a control device; 

said control device including: 

means responsive to a data rewrite request provided by . 
said upper device for reading old data designated by the 
data rewrite request from said plurality of disc devices; 

means for reading from said flash memory an old error 
correcting code which belongs to one group within said 
plurality of error correcting data groups to which one 
group the old data belongs; 
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means for generating a new error correcting code for the 
one error correcting data group after the old data has 
been rewritten by new data designated by the data 
rewrite request, from the read old data, said old error 
correcting code and the new data; 5 

means for rewriting the old data held in said plurality of 
disc device by said new data; 

means for writing the generated new error correcting code 
into the random access memory; 

means for repetitively operating said first mentioned 10 
reading means to mentioned writing means for each 
said last of a plurality of other data rewrite requests 
subsequently provided by said upper level device, 
thereby storing a group of new error correcting codes 
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for a group of data rewrite requests in said random 
access memory; 

means for erasing a group of storage locations within said 
flash memory which hold a group of old error correct- 
ing codes corresponding to the group of new error 
correcting codes held in said random access memory; 
and 

means for sequentially writing the group of new error 
correcting codes held in said random access memory 
into the group of storage locations within said flash 
memory, according to an order of access predetermined 
for storage locations within said flash memory. 

***** 
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